Still: Amortized KV Cache Compaction in a Single Forward Pass (arxiv.org)

0 points 1 hour ago ago | visit original

🤖 AI Summary

A new method called "Still" has been introduced to address the memory bottleneck in long-horizon language model deployments, specifically targeting the key-value (KV) cache. Leveraging a lightweight per-layer Perceiver trained against a frozen base model, Still achieves efficient KV cache compaction in a single forward pass. This technique stands out as it balances computational efficiency and contextual fidelity, outperforming existing methods that either sacrifice speed or require extensive optimization tailored to specific contexts. Significantly, Still shows impressive performance across various compression ratios, achieving rates from 8x to 200x with context lengths up to 128k tokens. In benchmark tests, it exceeds traditional baselines on the long-context RULER grid by 8–22 points. Additionally, its compact cache enables effective free-form summarization while retaining most of the benefits of full-context approaches. By allowing for iterative application of compaction, Still facilitates the use of memory-efficient models in scenarios previously deemed impractical, ultimately pushing the boundaries of how language models can handle extended contexts.

Loading comments...

loading comments...