🤖 AI Summary
Synthetic Bootstrapped Pretraining (SBP) is a new LM pretraining procedure that first trains a synthesizer to model relations between documents in a corpus, then uses that synthesizer to generate a large synthetic corpus for joint training with the original data. Unlike standard pretraining, which focuses on causal token correlations within individual documents, SBP explicitly captures inter-document correlations and uses them to create novel, coherent documents that preserve latent concepts but recombine narrations and perspectives. The authors validate SBP in a compute-matched setup by training a 3B-parameter model from scratch on up to 1 trillion tokens; SBP consistently outperforms a strong repetition baseline and attains a substantial portion of the gains an oracle would get from having 20Ă— more unique data.
Technically, SBP’s synthesizer abstracts core concepts shared across related documents rather than producing paraphrases, producing training samples that expand the model’s conceptual coverage. The method admits a Bayesian interpretation: the synthesizer approximates posterior abstractions of latent concepts, then samples new document realizations. Implications include improved sample efficiency, scalable data amplification when unique data is scarce, and a practical route to harness inter-document structure for better generalization. Practical caveats include the usual synthetic-data risks—distribution shift, amplified biases or hallucinations—and the need to balance synthesis quality against compute and verification costs.
Loading comments...
login to comment
loading comments...
no comments yet