Qwen3-VL can scan two-hour videos and pinpoint nearly every detail (the-decoder.com)

0 points 226 days ago ago | visit original

🤖 AI Summary

Alibaba published a detailed technical report for Qwen3‑VL, its open multimodal family (dense 2B–32B plus MoE 30B‑A3B and flagship 235B‑A22B) and showed the system can process extremely long visual contexts (≈262k-token windows) — enough to scan two‑hour videos or hundreds of PDF pages. In “needle‑in‑a‑haystack” tests the 235B model found specific frames with 100% accuracy in 30‑minute videos and 99.5% in two‑hour clips (~1M tokens). It leads many visual math and document benchmarks (MathVista 85.8%, MathVision 74.6%, DocVQA 96.5%, OCRBench 875 points) and supports 39 languages, though it trails GPT‑5 on some general reasoning and video QA tasks (e.g., MMMU‑Pro). Key technical advances enable those gains: interleaved MRoPE for distributed positional encodings to improve long‑video handling, DeepStack to expose intermediate vision‑encoder features, and a simple text‑timestamping scheme (e.g., “<3.8 seconds>”) replacing T‑RoPE. Qwen3‑VL was trained in four phases on up to 10,000 GPUs over ~1 trillion tokens, gradually expanding context from 8k→32k→≈262k and adding chain‑of‑thought “Thinking” variants. Crucially, weights are open under Apache‑2.0 on Hugging Face. The result is a high‑performing, research‑friendly multimodal specialist—especially for visual math, document understanding, and long‑context video analysis—that should accelerate open‑source multimodal work.

Loading comments...

loading comments...