Project Objective:
- Develop a Multimodal AI system that automatically generates advanced semantic metadata for videos.
- Achieve deep contextual understanding by analyzing visuals, audio (speech-to-text), speaker intent, and on-screen text rather than simple keyword extraction.
Primary Responsibilities / Tasks:
- Build a multimodal pipeline combining vision, audio, and text modalities, including synchronization of frames and transcripts.
- Extract semantic embeddings from speech, captions, and visual frames and fuse them for downstream tasks.
- Implement scene segmentation, topic detection, sentiment analysis, and speaker intent analysis to produce rich annotations.
- Generate structured metadata outputs such as tags, concepts, categories, timestamps, and summary descriptors.
- Integrate metadata outputs into a search & recommendation engine (indexing, retrieval relevance, ranking signals).
- Optimize the system for speed, relevance, and cross-language understanding (performance, latency, and multilingual embeddings).
Expected Outcome / Deliverables:
- A fully functional MVP capable of generating rich, meaningful metadata for arbitrary video inputs.
- Demonstrations and evaluation results showing scene segmentation quality, topic detection accuracy, embedding alignment across modalities, and end-to-end search/recommendation improvements.
Skills & Technologies Required:
- Strong background in AI/ML, deep learning, and multimodal architectures (vision + audio + text).
- Experience with NLP, speech-to-text systems, computer vision, embeddings, and video processing.
- Proficiency in Python and deep learning frameworks such as PyTorch or TensorFlow.
- Familiarity with practical concerns: model optimization, batching, GPU/CPU performance, and cross-language/multilingual models.
How to Apply:
- Send your application specific to this project to the recruiting contact below, including CV, short motivation, and relevant project/portfolio links.
- Email: career@olindias.com (use the subject specified below when applying).