Synthetic Bootstrapped Pretraining (arxiv.org)

🤖 AI Summary
Synthetic Bootstrapped Pretraining (SBP) is a new LM pretraining procedure that first trains a synthesizer to model relations between documents in a corpus, then uses that synthesizer to generate a large synthetic corpus for joint training with the original data. Unlike standard pretraining, which focuses on causal token correlations within individual documents, SBP explicitly captures inter-document correlations and uses them to create novel, coherent documents that preserve latent concepts but recombine narrations and perspectives. The authors validate SBP in a compute-matched setup by training a 3B-parameter model from scratch on up to 1 trillion tokens; SBP consistently outperforms a strong repetition baseline and attains a substantial portion of the gains an oracle would get from having 20× more unique data. Technically, SBP’s synthesizer abstracts core concepts shared across related documents rather than producing paraphrases, producing training samples that expand the model’s conceptual coverage. The method admits a Bayesian interpretation: the synthesizer approximates posterior abstractions of latent concepts, then samples new document realizations. Implications include improved sample efficiency, scalable data amplification when unique data is scarce, and a practical route to harness inter-document structure for better generalization. Practical caveats include the usual synthetic-data risks—distribution shift, amplified biases or hallucinations—and the need to balance synthesis quality against compute and verification costs.
Loading comments...
loading comments...