Synth: The New Data Frontier (pleias.fr)

0 points 5 hours ago ago | visit original

🤖 AI Summary

Frontier AI labs today released SYNTH, a fully synthetic, reasoning-focused pretraining corpus built from a “memory core” of 50,000 Wikipedia vital articles expanded into diverse, grounded tasks (math, RAG, editing, creative constraints, multi‑turn dialogue). Rather than dumping web text into massive crawls, SYNTH stitches together multi-step pipelines of fine‑tuned models that use seed attribution, embedding retrieval, randomized constraints and verification (LLM-as-judge / formal checks) to produce traceable reasoning paths. About 20% of the data is multilingual (major European languages); code was intentionally excluded. Seed and model attributions plus CC‑BY‑SA licensing aim to make the dataset legally reusable. Technically significant results: two deep, small “reasoners” trained solely on SYNTH achieved strong outcomes using far less data and compute than conventional pretraining. Monad (56M params, 64 layers) and Baguettotron (80 layers) were trained on under 200B tokens—10–50× less than comparable models—and final runs used <1,000 H100 hours (project-wide work ≈20,000 H100). Baguettotron reached SOTA on MMLU, gsm8k and HotPotQA; Monad shows non‑random performance as a minimal viable reasoner. SYNTH demonstrates that engineered, grounded synthetic traces plus extreme depth enable early emergence of reasoning skills, shifting emphasis from scale-of-data to quality-of-context and offering a practical path to efficient, auditable small reasoners and agentic training pipelines.

Loading comments...

loading comments...