Qwen3.5-35B – 16GB GPU – 100T/s with 120K context AND vision enabled (github.com)

0 points 4 hours ago ago | visit original

🤖 AI Summary

The new Qwen3.5-35B AI model has been released, optimized for local deployment on consumer hardware. It boasts impressive capabilities, achieving a peak speed of 125 tokens per second (t/s) while managing a context window of up to 120,000 tokens, and it is equipped with multimodal vision features that allow it to analyze images, PDFs, and screenshots seamlessly at high speeds. This model is particularly significant for the AI/ML community as it demonstrates the potential for powerful, high-performance language models to run efficiently on standard GPUs, such as the NVIDIA RTX 30xx, 40xx, and the newly tested RTX 5080. From a technical perspective, the Qwen3.5 employs a hybrid architecture that blends Gated DeltaNet and standard Gated Attention layers, optimizing its function while minimizing VRAM usage to only 15.4 GB, leaving ample memory for system operations. Users can configure it for various tasks, including coding and vision processing, with up to three ready-to-run profiles tailored to different needs. A critical discovery during its development is the "context cliff," where performance dramatically drops beyond 156,000 tokens due to bandwidth constraints, underscoring the importance of understanding hardware limitations when deploying large models in real-world applications. Overall, the Qwen3.5 represents a noteworthy advancement in local AI model performance.

Loading comments...

loading comments...