🤖 AI Summary
Goldman Sachs’ chief data officer Neema Raphael warned that AI is already running short of fresh training data, a shift that's reshaping how models are built. With much of the public web already consumed, Raphael suggested newer systems may be trained increasingly on the outputs of earlier models — a potential feedback loop exemplified by theories around China’s DeepSeek — which could lower the marginal cost of development but also propagate errors and homogenize model behavior. This “peak data” observation echoes concerns from other leaders, including OpenAI’s Ilya Sutskever, that the era of rapid gains from ever-larger web corpora is waning.
To fill the gap, teams are turning to synthetic data (machine-generated text, images, code) and untapped proprietary enterprise datasets. Synthetic data offers unlimited scale but risks introducing “AI slop” — low-quality or self-referential content that degrades downstream models. Raphael argues the more promising frontier is firm-held data (trading flows, client interactions), which can unlock higher-value, domain-specific capabilities if properly understood, normalized, and contextualized. The technical implications are clear: improved data engineering, careful curation/labeling, and safeguards against training-on-training will become central to model quality—and the community must watch whether reliance on synthetic sources leads to a creative plateau or an evolving new data economy.
Loading comments...
login to comment
loading comments...
no comments yet