Autoregressive or Diffusion Language Models, Why Choose? (arxiv.org)

0 points 2 hours ago ago | visit original

🤖 AI Summary

TiDAR introduces a sequence-level hybrid language model that "Thinks" with diffusion-style parallel drafting and then "Talks" by sampling autoregressively — all inside a single forward pass using specially designed structured attention masks. This architecture exploits high GPU compute density to run a parallel draft-and-verify pipeline while preserving causal AR sampling semantics via exact KV cache support. Unlike prior approaches that either use a weaker AR drafter (speculative decoding) or force AR-like left-to-right behavior onto diffusion (losing parallelism and quality), TiDAR keeps diffusion’s throughput advantages without sacrificing final-generation quality. Evaluated at 1.5B and 8B parameter scales against autoregressive models, speculative decoding, and diffusion variants (e.g., Dream, Llada), TiDAR delivers both higher measured throughput and improved sample quality — closing the quality gap with AR models while producing 4.71x–5.91x more tokens per second. For the AI/ML community this suggests a practical path to combine diffusion parallelism and autoregressive fidelity in a serving-friendly model (low overhead, exact KV caching), enabling higher-throughput, lower-latency text generation for large-scale deployments and prompting further research into hybrid attention patterns and sequence-level diffusion schemes.

Loading comments...

loading comments...