Nearest Neighbor Speculative Decoding for LLM Generation and Attribution (arxiv.org)

🤖 AI Summary
Researchers introduced Nearest Neighbor Speculative Decoding (NEST), a semi-parametric decoding method that reduces hallucinations and adds source attribution to LLM outputs by mixing parametric generation with token-level retrieval. Unlike kNN-LM, which retrieves nearest-neighbor tokens and often produces slow, disfluent text, NEST performs retrieval at every decoding step to form a semi-parametric mixture distribution and identify promising multi-token span continuations. It then runs an approximate speculative decoding routine that either accepts a retrieved span prefix as the next tokens or falls back to generating tokens from the base model, allowing incorporation of real-world text spans of arbitrary length while tagging provenance. Technically, NEST combines token-level retrieval with a proposal-accept/reject speculative mechanism to keep inference efficient and fluent. Empirically it improves generation quality and attribution rates on knowledge-intensive tasks, outperforms conventional kNN-LM, and is competitive with in-context retrieval augmentation approaches. Crucially, the approach also speeds up inference—achieving about a 1.8× latency improvement when applied to Llama-2-Chat 70B—making grounded, attributable generation more practical for large models. Code will be released, positioning NEST as a promising tool for building faster, more trustworthy LLMs with explicit provenance.
Loading comments...
loading comments...