AI and Wikipedia have sent vulnerable languages into a doom spiral (www.technologyreview.com)

0 points 1 day ago ago | visit original

🤖 AI Summary

Small Wikipedia editions for vulnerable languages are being flooded with poor, machine‑translated articles, creating a vicious feedback loop that damages both the encyclopedias and the AI systems that ingest them. Volunteers like Kenneth Wehr found entire Greenlandic pages full of grammatical nonsense and factual errors produced by machine translation; similar issues affect Inuktitut and multiple African languages, where volunteers estimate 40–60% of pages are uncorrected MT or where audits show over two‑thirds of longer pages include MT content. Tools such as Wikipedia’s Content Translate make bulk creation easy, but rely on external MT and often produce subpar output—English Wikipedia reports ~95% of Content Translate drafts fail to meet standards without heavy editing. The technical problem is simple but consequential: many AI models learn minority languages almost exclusively from web text, and Wikipedia is often the largest easily accessible corpus for low‑resource languages (sole source for 27 languages in one scrape study). When those pages are garbage, models learn garbage—mistranslating nouns, dates, or culturally specific morphemes—especially for agglutinative languages like Greenlandic where word morphology is complex. The result is a “doom spiral”: poor MT begets more poor pages, which corrupt training data, which further degrades MT quality. Consequences include eroded language integrity, fewer competent editors, and AI systems that misrepresent or marginalize endangered languages—underscoring an urgent need for community moderation, better tooling, and curated linguistic datasets.

Loading comments...

loading comments...