Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency (blog.google)

0 points 1 hour ago ago | visit original

🤖 AI Summary

Gemma has announced the release of new Quantization-Aware Training (QAT) models for their Gemma 4 series, enhancing model compression for use on mobile and consumer laptop devices. Following the introduction of Multi-Token Prediction to improve inference speed, this update delivers checkpoints optimized for both the popular Q4_0 quantization format and a new mobile-specialized format. This engineered compression reduces the memory footprint of models like Gemma 4 E2B to under 1 GB, enabling them to run efficiently on everyday edge hardware while maintaining high-quality performance. The significance of this update lies in its ability to mitigate the performance degradation typically associated with Post-Training Quantization (PTQ) by integrating quantization directly into the training process. This not only preserves model quality but also accelerates decoding speeds and reduces VRAM requirements. Unique features like static activations and channel-wise quantization enable smoother operation on mobile processors, ensuring that complex computations are handled efficiently. With partnerships across developer tools and easy access to model weights on platforms like Hugging Face, the Gemma 4 QAT models aim to empower developers to harness advanced AI capabilities on local devices seamlessly.

Loading comments...

loading comments...