ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference (arxiv.org)

0 points 6 days ago ago | visit original

🤖 AI Summary

Researchers introduced ChunkLLM, a lightweight, pluggable framework that speeds up Transformer inference on long contexts by selecting and compressing relevant token “chunks” rather than attending to every token pair. ChunkLLM adds two small adapter modules — a QK Adapter (split into Q-Adapter and K-Adapter) attached to each Transformer layer for feature compression and chunk-attention estimation, and a Chunk Adapter at the bottom layer to detect chunk boundaries using contextual semantics. The backbone model remains frozen; only the adapters are trained using a novel attention-distillation loss that improves recall of key chunks. At inference, chunk selection is only triggered when a chunk boundary is detected, reducing attention computation and KV cache updates. The result is significant for scaling LLMs to very long inputs with minimal retraining: ChunkLLM matches short-text performance, preserves 98.64% of long-context performance while retaining 48.58% of key-value cache entries, and achieves up to 4.48× speedup on 120K-token sequences versus a vanilla Transformer. By being pluggable and adapter-based, it enables efficient deployment on existing models without full fine-tuning, offering a practical tradeoff between a tiny accuracy drop (~1.36%) and large inference savings—useful for long-document QA, summarization, and memory-augmented LLM applications.

Loading comments...

loading comments...