Ai2 Dolma: 3T token open corpus for language model pretraining (2023) (allenai.org)

🤖 AI Summary
The Allen Institute for AI released Dolma, a 3-trillion-token open pretraining corpus — the largest open dataset to date — on the Hugging Face Hub to support OLMo and broader reproducible research into large language models. The dataset was designed to be open, representative, and large enough to probe scaling laws (e.g., Chinchilla) and data–model tradeoffs, while emphasizing reproducibility and harms-based risk mitigation. Ai2 also published an initial datasheet and the tooling used to build Dolma, with a fuller manuscript forthcoming. Dolma aggregates web text, academic publications, books, encyclopedic material and “just enough” code (sourced from The Stack), built primarily from 24 Common Crawl snapshots (2020-05 to 2023-06) plus C4 (Apr 2019). It is English-only (fastText lang-ID, permissive 50% threshold), uses CCNet for extraction, applies Gopher-style paragraph filters (including removing paragraphs not ending with punctuation), two-stage deduplication (URL-level and intra-document paragraph dedupe using Bloom filters), and source-specific/source-agnostic pipelines. Risk controls include PII masking (emails, phones, IPs), logistic classifiers and regex, and Jigsaw-based harmful-content filtering with a conservative >60% cutoff. By matching common preprocessing practices while openly publishing data and tools, Dolma enables independent model replication, investigation of dataset effects on capabilities and harms, and community scrutiny of large-scale pretraining data.
Loading comments...
loading comments...