🤖 AI Summary
The paper studies pretraining when compute is effectively unlimited but fresh web text is scarce, asking how to maximize performance under fixed data. The authors show common “more epochs + bigger models” recipes overfit under data constraints, and that much stronger regularization is crucial: optimal weight decay is about 30× larger than typical defaults. With this tuned regularized recipe, validation loss falls monotonically with parameter count following a power law, so they estimate ultimate performance via the scaling-law asymptote rather than a fixed compute budget.
They further show ensembling independently trained models lowers that loss asymptote even more, and that combining epoching, heavy regularization, parameter scaling, and ensembling reaches an asymptote at 200M tokens while using 5.17× less data than their baseline. Crucially, the ensemble can be distilled into a student 8× smaller that preserves ~83% of the ensemble benefit. The interventions transfer to downstream tasks: ~9% improvement on pretraining evals and a 17.5× data-efficiency win over naïve continued pretraining on math data. The work implies that in a compute-rich, data-limited future, careful regularization, ensemble+distill pipelines, and asymptotic scaling-law analysis can yield large, cost-effective gains in pretraining efficiency.
Loading comments...
login to comment
loading comments...
no comments yet