Show HN: Cleaned 2.7M French Wikipedia JSON articles (full dataset)" (huggingface.co)

0 points 245 days ago ago | visit original

🤖 AI Summary

A cleaned, normalized dump of the entire French Wikipedia — over 2.7 million articles — has been released as one JSON file per page, ready for NLP and ML use. The corpus strips wikitext and returns plain, structured content under a unified JSON schema (title, cleaned text, sections, infobox, categories, internal/external links, references, metadata). Distributed across 10 compressed archives (wiki_clean_block_00..09.tar.gz) with predictable filenames (article_0000001.json …), it’s explicitly positioned for LLM pretraining/finetuning, embeddings, retrieval-augmented generation (RAG), semantic search, knowledge-graph extraction, and linguistic or academic research. License: CC BY‑SA 4.0 (same as Wikipedia). Key technical notes: JSON records include fields like id, url, text, sections[], infobox{type, fields}, categories[], links_internal[], links_external[], references[] and metadata (source: frwiki, dump_date: 2025-11, version: 1.0). The author (Zeronex) warns that some pages retain missing metadata and structural variability from the original dump; cleaning removes markup but doesn’t invent data. A dataset viewer hit a streaming error when previewing a split — a TypeError casting a nested struct (image, légende) — suggesting some nested fields (e.g., image captions) may need special handling with certain tools. Overall, this is one of the largest readily usable open French corpora, but users should apply bias-aware curation and tooling that supports nested/complex types.

Loading comments...

loading comments...