Develop an AI system capable of translating spoken language in real time within video streams while preserving natural lip synchronization on the target-language video output.
Aim to combine automatic speech recognition (ASR), neural machine translation (NMT), text-to-speech (TTS) or voice conversion, and visual lip-sync generation to produce low-latency, high-quality translated video suitable for live or near-live scenarios.
Objectives & Key Tasks
Design and implement a pipeline that performs: speech capture → ASR → translation → speech generation → lip-sync-driven video rendering, minimizing end-to-end latency.
Research and integrate state-of-the-art models for real-time ASR, low-latency NMT, and lip-sync synthesis (e.g., audio-driven facial animation / viseme alignment), and evaluate trade-offs between quality and speed.
Technical Requirements & Skills
Strong knowledge in Machine Learning, especially deep learning frameworks (PyTorch or TensorFlow) and experience with sequence-to-sequence models.
Experience in Computer Vision (video processing, facial landmark detection), Speech/NLP (ASR, NMT, TTS) and real-time inference optimization (quantization, model pruning, batching strategies).
Deliverables & Evaluation
A working prototype demonstrating live or near-real-time translation with synchronized lip movements on output video, plus qualitative and quantitative evaluation (latency, translation accuracy, lip-sync accuracy, perceptual quality).
Documentation, source code, and a short demo video showcasing typical use-cases and performance metrics.
Tools & Environment
Development on Python with common ML libraries (PyTorch/TensorFlow, OpenCV, Hugging Face Transformers, Kaldi/ESPnet or similar for ASR, TTS toolkits).
Optional deployment targets: desktop GPU, edge device, or cloud inference; include notes on scalability and latency optimization.
How to Apply
To apply for this PFE internship, send your CV, a brief cover letter describing relevant projects, and links to any demos or repositories.
Use the subject line: "PFE Application - 10 Subject 4: AI Model for Real-Time Video Translation with Live Lip Sync" and send to career@olindias.com.