Show HN: Marlin-2B: a tiny VLM to extract structured information from videos (huggingface.co)

0 points 2 days ago ago | visit original

🤖 AI Summary

Marlin-2B is a cutting-edge video visual language model (VLM) designed to extract structured information from videos, now available to developers. Boasting 2 billion parameters, Marlin is the strongest open model in its class for dense captioning and natural-language temporal grounding, outperforming competitors like Gemini-2.5 at a fraction of the cost. It enables precise scene and event captions with second-level timestamps, answering critical questions of "what is happening?" and "when?" in video content. Significantly, Marlin showcases state-of-the-art performance on key benchmarks, including DREAM-1K and TimeLens-Bench, bridging the gap with larger models while being lightweight enough to run on a single consumer GPU. Its architecture is a fine-tune of Qwen3.5-2B, featuring a simple interface with methods for captioning and time-based querying, along with compatibility for standard Hugging Face transformers. The model has been trained on a robust, high-quality dataset that enhances its ability to deliver accurate temporal information and fine-grained descriptions, making it an invaluable tool for developers looking to implement advanced video analysis in their applications.

Loading comments...

loading comments...