A key type of AI training data is running out. Googlers have a bold new idea to fix that. (www.businessinsider.com)

0 points 1 day ago ago | visit original

🤖 AI Summary

Google DeepMind researchers unveiled a technique called Generative Data Refinement (GDR) that uses pretrained generative models to “clean” training data that labs normally discard for being toxic, inaccurate, or containing personally identifiable information. Instead of throwing away entire documents because of a single bad token—say, a Social Security number or an outdated line like “the incoming CEO is…”—GDR rewrites or removes the offending pieces while preserving the rest. In a proof-of-concept on over one million lines of code with human expert labels, the method reportedly outperformed existing industry filters and produced better training sets than synthetic-data approaches, which can degrade model quality or even cause “model collapse.” This matters because the supply of high-quality, human-generated text suitable for training is shrinking: one study predicts indexed web text could be exhausted between 2026 and 2032. GDR promises a way to reclaim otherwise unusable tokens and scale frontier models without relying solely on synthetic data. Caveats: the paper is newly published but not peer reviewed, it’s unclear whether Google applies GDR to commercial models like Gemini, and broader legal/privacy implications (and effectiveness on copyrighted or cross-document inferred PII) need more testing. The authors also suggest the idea could extend beyond text and code to other modalities, though video and images currently offer more raw data volume.

Loading comments...

loading comments...