StreamingVLM: Real-Time Understanding for Infinite Video Streams (arxiv.org)

🤖 AI Summary
StreamingVLM is a new vision-language model architecture and training recipe for real-time, stable understanding of effectively infinite video streams. It addresses the core scalability problem of VLMs—full attention over long videos is quadratic in cost and impractical, while naive sliding windows either lose coherence or waste compute through redundant recomputation. StreamingVLM aligns training with streaming inference: at run-time it keeps a compact KV cache by reusing the states of attention sinks, a short window of recent visual tokens, and a longer window of recent text tokens. During supervised fine-tuning (SFT) the model is trained with full attention on short, overlapped video chunks so the learned behavior matches the low-latency attention pattern used at inference without requiring prohibitively long training contexts. The authors also release Inf-Streams-Eval, a dense per-second benchmark with videos averaging over two hours, and report strong real-time results: StreamingVLM wins 66.18% of comparisons against GPT-4O mini and runs stably at up to 8 FPS on a single NVIDIA H100. The SFT strategy also boosts general VQA performance without VQA-specific tuning (LongVideoBench +4.30, OVOBench Realtime +5.96). Overall, StreamingVLM offers a practical, code-released approach for memory- and latency-efficient continuous video understanding—making it immediately relevant for real-time assistants, autonomous agents, and long-video analytics.
Loading comments...
loading comments...