Hugging Face Releases FinePDFs: A 3T-Token Dataset Built from PDFs (www.infoq.com)

0 points 1 day ago ago | visit original

🤖 AI Summary

Hugging Face has released FinePDFs, the largest public corpus built entirely from PDFs: 475 million documents in 1,733 languages, roughly 3 trillion tokens and 3.65 TB of data. Unlike typical web-derived corpora, PDFs concentrate higher-quality, domain-specific content (law, academia, technical writing) and often contain much longer documents, making them valuable for long-context LLM training. The dataset’s language spread is broad — English accounts for >1.1T tokens, Spanish/German/French/Russian/Japanese each exceed 100B, and 978 languages have more than 1M tokens. FinePDFs is available under the Open Data Commons Attribution license on the Hugging Face Hub and accessible via datasets, huggingface_hub and Datatrove. Technically, Hugging Face solved the long-standing PDF challenge by combining a text-first extractor (Docling) with GPU-accelerated OCR (RolmOCR), plus deduplication, language ID and PII anonymization to process heterogeneous formats at scale. They evaluated FinePDFs by training 1.67B-parameter models on subsets: performance was near parity with SmolLM-3 Web (an HTML benchmark), and mixing PDF + web data produced measurable gains — indicating complementary knowledge. Hugging Face documented the pipeline for transparency; they report benchmark results as probabilities of correct choice rather than a single score, prompting community discussion but underscoring a move toward richer, source-diverse training data for research and long-context modeling.

Loading comments...

loading comments...