Fast-DLLM: Training-Free Acceleration of Diffusion LLM (arxiv.org)

🤖 AI Summary
Fast-dLLM is a training-free toolkit that dramatically speeds up diffusion-based large language models by enabling KV caching and safer parallel decoding. The paper introduces a block-wise approximate KV cache tailored to bidirectional diffusion LLMs so previous key/value states can be reused across denoising steps with negligible accuracy loss. It also diagnoses why naive parallel decoding hurts quality: enforcing conditional independence breaks token dependency structure. To fix this, the authors propose a confidence-aware parallel decoding scheme that only finalizes tokens whose predicted confidence exceeds a threshold, leaving uncertain tokens for further refinement — thereby reducing dependency violations while retaining parallelism. The combined techniques close much of the runtime gap between diffusion and autoregressive LLMs without retraining: experiments on LLaDA and Dream show up to 27.6× throughput gains across standard benchmarks with minimal accuracy degradation. Because the methods are training-free and model-agnostic, they make diffusion LLMs far more practical for real-world inference, enabling high-throughput, non-autoregressive generation while preserving output quality — an important step toward deploying parallel-decoding architectures in production.
Loading comments...
loading comments...