Omkar Thawakar

PhD Researcher | Multimodal AI | Video Understanding | LLMs & Agents

PhD researcher at MBZUAI, working on multimodal reasoning, video understanding, large multimodal models (LMMs), and self-evolving AI systems, with strong focus on real-world deployment.

Highlights

[CVPR 2026] 3 papers accepted.
[ICLR-SLLM 2025] Spotlight (Top-2%) for MobiLLaMA.
[CVPR 2025] Highlight for All Languages Matter (LMM Evaluation).
[Impact] 300K+ HuggingFace downloads across models.
[Award] Khalifa Fund Entrepreneurship Competition Winner (250K AED).
[Award] Sandook Al Watan Entrepreneurship Competitio Winner.
[Startup] Founder & Tech Lead @ Lawa.AI.

Spotlight Research

MobiLLaMA (ICLR 2025)

Accurate & Lightweight Fully Transparent GPT. 200K+ Downloads.

Read Paper

LlamaV-o1 (ACL 2025)

Rethinking Step-by-Step Visual Reasoning in LLMs.

Read Paper

Recent Projects

VisQ (Visual Query)

iOS application for composed image and video retrieval on iPhone

VisQ brings reason-aware visual retrieval to iPhone with an on-device Qwen3-VL-2B Core ML runtime. Users can search personal media with natural language or run composed retrieval using a reference image + edit prompt, then inspect "Why This Matched" explanations powered by the model's reasoning capability.

On-device AI Composed Retrieval Explainable Results Privacy-First Offline-First

Indexes local photos and videos directly from the iPhone photo library.
Supports text search and reference-image-guided retrieval with scene edits.
Surfaces human-readable match reasons and visual explanation chips.
Keeps embeddings, ranking, and inference on-device for privacy-preserving search.

App Store Download Now GitHub Source code

Built from research

VisQ is based on our recent research work CoVR-R: Reason-Aware Composed Video Retrieval, translating reason-aware composed retrieval into a practical iPhone app for local-first multimodal search.

Available now on the Apple App Store.

arXiv Research GitHub