🤖 AI Summary
A new study shows that masked diffusion models beat autoregressive (AR) models when training is data‑constrained but compute is plentiful. The authors trained hundreds of models across wide ranges of model size, unique data, and epochs to disentangle compute vs. data effects and fit scaling laws. Key empirical takeaways: at low compute AR wins, but beyond a predictable “critical compute” (a power‑law threshold the paper fits analytically) diffusion continues improving while AR plateaus or overfits; diffusion achieves a better final validation loss (e.g., 3.51 vs. 3.71) and stronger downstream benchmark performance under the same limited data budget.
Technically, diffusion’s advantage stems from its training objective (random masking) acting like implicit data augmentation: the model sees many conditional prediction tasks (different token orderings) and so tolerates extreme data repetition. Measured metrics are striking — the data‑reuse half‑life R_D* is ~500 for diffusion versus ~15 for AR — and diffusion shows no overfitting even after 100× epochs. Prior reports that diffusion needs ~12–16× more compute than AR conflated compute with data; this work shows that diffusion instead extracts far more value from repeated data. Practical implication: if you’re compute‑rich but data‑poor, prefer diffusion (or hybrids that tune task diversity); if compute‑limited, stick with AR.
Loading comments...
login to comment
loading comments...
no comments yet