Unlocking a Million Times More Data for AI (ifp.org)

0 points 2 days ago ago | visit original

🤖 AI Summary

AI progress hasn’t stalled for lack of raw bits — it’s stalled because AI labs can’t access the best data. This essay from The Launch Sequence argues that today’s leading models train on hundreds of terabytes of data, while the world has digitized roughly 180–200 zettabytes — over a million times more data. The bottleneck is access: privacy rules, proprietary incentives, and the economics of copying make organizations hoard high-quality, structured datasets (health records, financial logs, industrial sensors) that would be far more valuable than additional web-scraped text. The authors propose Attribution-Based Control (ABC) as a way to turn that data into a shared, sustainable resource instead of a one-time giveaway. ABC is a design criterion, not a single tool: AI systems must let data owners control which specific predictions their data supports and let AI users see and choose which data sources contributed to outputs. Technically this relies on model partitioning by data source plus privacy-preserving infrastructure to audit and meter contributions. Economically, ABC converts data into an ongoing revenue stream (think royalties) that aligns incentives to share rather than silo. The paper recommends a government-led development program (modeled on ARPANET) to build standards and infrastructure. While legal, technical, and economic hurdles remain, ABC offers a concrete path to unlock zettabytes of high-quality, grounded data that could fuel the next paradigm shift in AI.

Loading comments...

loading comments...