Beyond Next-Token Prediction: Autoregressive LMs (arxiv.org)

🤖 AI Summary
A new paper systematically compares autoregressive language models (ARMs) and diffusion language models (DLMs), using theoretical analysis and profiling to quantify where each approach wins and loses. ARMs—the dominant next-token prediction paradigm—suffer from low arithmetic intensity because of strict sequential conditioning, limiting hardware utilization for single long sequences. DLMs, which generate tokens in parallel, show higher arithmetic intensity and better per-token compute utilization, but they struggle to scale to long contexts and incur latency from many sampling steps. The authors benchmark trade-offs across context length, batch size, and decoding strategies to paint a clear performance landscape. Key technical takeaways: block-wise decoding for DLMs can recover scalability to long contexts while maintaining improved arithmetic intensity, offering a hybrid that approaches ARM-like context scaling with better compute utilization. For batched inference, ARMs still deliver superior throughput by exploiting parallelism across multiple sequences in a batch. The paper highlights concrete engineering levers—reducing DLM sampling steps, optimizing block decoding, and targeting hardware-aware kernels—to make DLMs competitive in latency-sensitive, open-source deployments. These results guide model and systems design choices (hybrid decoding, workload batching, and sampler improvements) for researchers and practitioners seeking better trade-offs between throughput, latency, and context scaling.
Loading comments...
loading comments...