Thinking with Visual Primitives (github.com)

0 points 1 hour ago ago | visit original

🤖 AI Summary

A new technical report has been released detailing a groundbreaking approach in artificial intelligence titled "Thinking with Visual Primitives." This advancement aims to address the ongoing challenge of the "Reference Gap" in Multimodal Large Language Models (MLLMs), which struggle with complex structural reasoning due to the ambiguity of natural language. The proposed model innovatively integrates spatial markers—such as points and bounding boxes—into its reasoning processes. This method allows the model to anchor abstract linguistic concepts to specific physical coordinates, thereby mimicking human cognitive behaviors and enhancing its ability to perform complex tasks more accurately. Significantly, this framework builds on the architecture of DeepSeek-V4-Flash, greatly improving visual token efficiency by compressing four visual tokens into a single entry. Despite a compact model size and reduced image-token budget, the new model demonstrates performance comparable to leading AI models like GPT-5.4 and Claude-Sonnet-4.6 across various benchmarks in counting and spatial reasoning. In the near future, the team plans to release in-house benchmarks and some cold-start data, along with the model weights, promising to further engage the AI/ML community with a resource that could reshape how complex reasoning is modeled in artificial intelligence.

Loading comments...

loading comments...