Qwen3-Omni: Native Omni AI model for text, image and video (github.com)

0 points 10 hours ago ago | visit original

🤖 AI Summary

Qwen3-Omni is a new natively end-to-end multilingual omni-modal foundation model family that ingests text, images, audio and video and produces real-time streaming outputs as text and natural-sounding speech. The release emphasizes low-latency, interactive multimodal dialogue (natural turn-taking) and strong audio/video capabilities: the team claims state-of-the-art results on 22 of 36 audio/video benchmarks and open-source SOTA on 32 of 36, with ASR, audio understanding and voice-conversation performance comparable to Google’s Gemini 2.5 Pro. It supports 119 text languages, 19 speech-input and 10 speech-output languages, and includes three main 30B variants (Instruct, Thinking, and an open-source fine-tuned Captioner for detailed, low-hallucination audio captioning). Key technical points: Qwen3-Omni uses a mixture-of-experts (MoE) Thinker–Talker architecture with AuT pretraining to learn strong general representations and a multi-codebook design to minimize latency. Training combines early text-first pretraining with mixed multimodal training so unimodal text/image quality is preserved while boosting audio/video skills. Practical notes include recommended inference paths (vLLM or DashScope API for low latency; Docker runtime provided), use of FlashAttention 2 with fp16/bfloat16 for memory efficiency, and a qwen-omni-utils toolkit plus cookbooks covering ASR, translation, OCR, object grounding, audio-visual Q&A, agent-style audio function calls, and more—making it well suited for interactive multimodal agents and production deployment.

Loading comments...

loading comments...