Gemma 4 architecture support for QVAC-Fabric (Tether's llama.cpp fork) (github.com)

0 points 2 hours ago ago | visit original

🤖 AI Summary

The recent update to QVAC-Fabric, a fork of Tether's llama.cpp, introduces full support for the Gemma 4 architecture. This is significant for the AI/ML community as it expands the capabilities of QVAC-Fabric, which already offered advanced features like memory-based model loading and on-device LoRA hot-swapping, but previously lacked compatibility with Gemma 4. The patch aims to enhance model performance, particularly on hardware like the NVIDIA RTX 4090, achieving impressive processing rates of 132.5 tokens/second for prompts and 116.5 tokens/second for generation with a model memory requirement of 1,416 MiB. Key technical details include modifications to attention head dimensions in the sliding window attention layers and the implementation of per-layer KV caching to optimize performance. Notably, QVAC-Fabric now differentiates between attention heads for SWA and non-SWA layers, a significant change that boosts the adaptability of models built on this architecture. The update also streamlines tensor loading and shape computations for better efficiency. These enhancements promise to elevate the performance of AI applications leveraging the Gemma 4 architecture while maintaining compatibility with existing systems through a straightforward patching process.

Loading comments...

loading comments...