The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text (arxiv.org)

🤖 AI Summary
Researchers released the Common Pile v0.1, an 8 TB corpus of public-domain and openly licensed text assembled from 30 diverse sources — research papers, code, books, encyclopedias, educational materials, audio transcripts and more — specifically curated for LLM pretraining. To validate the dataset, the team trained two 7-billion-parameter models, Comma v0.1-1T and Comma v0.1-2T, on 1 trillion and 2 trillion tokens respectively; both models achieved competitive performance with similarly sized Llama 1 and 2 7B models trained on unlicensed corpora. Alongside the data release, the authors published the dataset-creation code, the training mixture, and model checkpoints. This work is significant because it shows that large, license-compliant corpora can be large and high-quality enough to produce competitive LLMs, addressing legal and ethical concerns about training on unlicensed text. By open-sourcing the dataset, code, and checkpoints, the project boosts reproducibility, auditability and community-driven improvements, and provides a practical blueprint for license-aware pretraining. For practitioners, the Common Pile offers a ready-made, diverse pretraining corpus for experimenting with scale, fine-tuning, and safer model release practices, while setting a baseline for future openly licensed datasets and iterations.
Loading comments...
loading comments...