FinePDFs: Liberating 3T of the finest tokens from PDFs (huggingface.co)

0 points 9 days ago ago | visit original

🤖 AI Summary

FinePDFs is a groundbreaking new dataset liberating approximately 3 trillion tokens extracted exclusively from 475 million PDF documents spanning 1733 language-script pairs. By tapping into this untapped trove of PDFs—historically avoided due to their extraction complexity and cost—FinePDFs offers a rich, diverse text corpus rivaling state-of-the-art HTML-based datasets like SmolLM-3. Importantly, mixing FinePDFs with traditional web-crawled data leads to notable performance boosts across language benchmarks, demonstrating its high utility for training robust multilingual and multi-script language models. Technically, FinePDFs was created using a sophisticated two-tier extraction pipeline: a fast text-based method for digitally-born PDFs and a resource-intensive GPU-based OCR pipeline for scanned documents, guided by an XGBoost classifier to select the optimal approach per file. The dataset underwent extensive cleaning steps including deduplication (both exact and MinHash), language identification, PII anonymization, boilerplate removal, and error filtering. It also uniquely captures complex code-switching documents, common in legal and academic PDFs, providing challenging real-world data diversity. Released under an open license with full reproducibility and processing code via the datatrove library, FinePDFs sets a new standard for large-scale, high-quality language corpora from a previously inaccessible document format, promising to enrich future AI/ML models with richer, multifaceted linguistic context.

Loading comments...

loading comments...