🤖 AI Summary
Large language model inference is often constrained by the Key-Value (KV) cache size—storing past keys and values across long contexts consumes a lot of memory. The paper “Expected Attention” proposes a training-free KV-cache compression technique that estimates each KV pair’s importance by computing the expected attention it would receive from future queries. Instead of relying on unavailable future attention scores or materializing the full attention matrix (which frameworks like FlashAttention avoid), the method leverages distributional properties of LLM activations to derive closed-form expected attention scores. Those scores are used to rank and prune KV pairs with minimal perturbation to the residual stream, enabling principled compression during both prefilling and autoregressive decoding.
Practically, Expected Attention is notable because it requires no model retraining, works with modern attention implementations, and consistently outperforms existing KV-pruning baselines in experiments. The approach directly targets inference memory and throughput trade-offs—making it easier to run larger contexts, bigger batches, or run models on less-memory hardware without significant quality loss. The authors also released KVPress, a library containing implementations and benchmarks for 20+ KV-compression methods, providing a reproducible platform for researchers and engineers to evaluate and adopt cache-compression strategies.
Loading comments...
login to comment
loading comments...
no comments yet