🤖 AI Summary
KVzip is a new query-agnostic KV-cache compression technique (NeurIPS 2025 oral) that compresses long-context key/value caches while preserving the ability to answer diverse future queries. By combining head-level importance scoring with a context-reconstruction strategy, KVzip reports a 3–4× reduction in KV cache size and roughly a 2× cut in decoding latency with only minimal quality degradation. The system supports both context-dependent compression (better ratios) and a faster context-independent mode (set load_score=True) that removes score-computation overhead at the cost of a higher retained ratio (~0.6).
Key technical highlights: KVzip adapts an AdaKV CUDA kernel to allow non-uniform per-head budget allocation and integrates DuoAttention-style head-level pruning. Head importance scores can be computed in a few forward passes in under a minute (≈100× faster than previous methods), and precomputed scores are provided for LLaMA3.1-8B and Qwen2.5-7/14B. The implementation (CUDA 12.1, Python 3.10) exposes practical APIs and CLI tools to prefill, prune (ratio param), retain/evict KVs, and toggle multi-turn updates (update_cache). NVIDIA’s KVpress added support and a leaderboard tracks results; caveats include no optimized kernel yet for Gemma3 (so only reduced-attention evaluation is possible). Overall, KVzip makes long-context inference far more memory- and latency-efficient, easing deployment of large LMs for long-document and multi-turn applications.
Loading comments...
login to comment
loading comments...
no comments yet