🤖 AI Summary
Goldman Sachs’ chief data officer Neema Raphael warned on the firm's podcast that AI is already running short of fresh training data — a development reshaping how new models are built. With the open web largely tapped out, teams are leaning on synthetic data (machine-generated text, images, code) to keep scaling, but that creates risks: models trained on other models’ outputs can amplify errors and produce low-quality “AI slop.” Raphael pointed to reports about China’s DeepSeek as an example of systems possibly trained more on existing model outputs than novel human-generated data, highlighting a potential “data cascade” where each generation inherits and magnifies prior models’ biases and noise.
The practical implication is a shift from public scraping to proprietary, enterprise datasets as the next growth vector. Corporations hold rich, domain-specific signals — trading flows, client interactions, product telemetry — that could materially improve model utility if properly understood, normalized, and integrated. Raphael emphasized the hard work is not just finding data but contextualizing and cleaning it for business consumption. The broader concern is philosophical and technical: if training increasingly relies on synthetic content, the field risks a creative plateau and erosion of human signal, making dataset curation and access to high-quality proprietary data strategic priorities for the AI/ML community.
Loading comments...
login to comment
loading comments...
no comments yet