Refrag: Rethinking RAG Based Decoding (arxiv.org)

🤖 AI Summary
Researchers have introduced REFRAG, a novel decoding framework that significantly improves efficiency in retrieval-augmented generation (RAG) tasks for large language models (LLMs). Traditional RAG approaches struggle with high latency and heavy memory use due to the long-context inputs formed by concatenated retrieval passages, many of which contribute little to the actual query. REFRAG leverages the unique block-diagonal attention patterns arising from the sparse and diverse nature of retrieved content to prune unnecessary computations during decoding, leading to a dramatic reduction in processing time without sacrificing model accuracy. Technically, REFRAG employs a three-step process—compress, sense, and expand—that exploits sparsity in the key-value cache, achieving a 30.85× speedup in time-to-first-token, which is over three times faster than previous optimizations, all while maintaining perplexity levels comparable to state-of-the-art models. Additionally, the framework extends the effective context length of LLMs by 16×, enabling better handling of long documents, multi-turn conversations, and other tasks reliant on large retrieval contexts. This work is especially significant for AI/ML practitioners focused on optimizing retrieval-based systems, as it balances knowledge integration with computational efficiency, paving the way for more responsive and scalable RAG applications.
Loading comments...
loading comments...