Researchers show that training on “junk data” can lead to LLM “brain rot” (arstechnica.com)

0 points 7 days ago ago | visit original

🤖 AI Summary

Researchers from Texas A&M, the University of Texas and Purdue published a preprint proposing an "LLM brain rot hypothesis": continual pre-training on low‑quality, engagement‑driven web text can induce lasting cognitive decline in language models, analogous to attention and memory problems seen in humans who consume trivial online content. To test this, they extracted "junk" and control subsets from a 100M‑tweet HuggingFace corpus. One junk metric picked tweets with high engagement and short length (assuming popular, short posts are trivial), while a second used a GPT‑4o prompt to surface tweets with low "semantic quality" — sensationalized, conspiratorial, or superficially attention‑seeking content. A random sample of the LLM classifications matched three graduate student raters 76% of the time. The work is significant because it quantifies how dataset quality — not just quantity — matters for continual pre‑training and model robustness, highlighting risks of indiscriminate scrapes of social media. Technically it shows a pragmatic pipeline for operationalizing "junk" using engagement signals and LLM‑based semantic filters, but also underscores subjectivity in those labels and the need for rigorous, task‑level measures of degradation. Implications include stronger data curation, provenance tracking, and filtering for retraining pipelines, plus further experiments across architectures, tasks and longer‑range continual learning to measure real-world impact.

Loading comments...

loading comments...