🤖 AI Summary
Researchers show that diffusion language models (DLMs) can significantly outperform autoregressive (AR) models when unique pretraining data is limited by repeating it and training for more epochs — a phenomenon they call a "Crossover." In controlled experiments across dense and sparse architectures, larger models make the crossover happen sooner, while more or higher-quality unique data pushes it later. Key large-scale results include a 1.7B-parameter DLM trained with ~1.5T-token compute on 10B unique Python tokens beating a matched AR coder, and a 1B DLM reaching >56% on HellaSwag and >33% on MMLU using only 1B tokens by repeating standard data.
The authors attribute DLM advantages to three compounding factors: any-order (bidirectional) modeling, iterative denoising that yields "super-dense" compute (many effective updates per token), and built-in Monte Carlo augmentation from stochastic denoising trajectories. Simple noise injections improve AR training under data scarcity but don't close the gap. Practically, this implies different scaling behavior and evaluation needs: rising validation cross-entropy during repeated-data training need not predict worse downstream performance, and DLMs may be a better choice in data-constrained or compute-dense regimes (including code modeling), motivating new training and model-selection strategies for low-data scenarios.
Loading comments...
login to comment
loading comments...
no comments yet