To surface novel training data, AI needs data valuation (research.portexai.com)

0 points 2 days ago ago | visit original

🤖 AI Summary

AI labs are increasingly treating data as the scarce input that will determine future model performance and margins: data-labeling revenue reportedly jumped 88x from 2023–2024, OpenAI has committed ~$1B this year with plans to scale to $8B by 2030, and high-profile licensing deals (Reddit/OpenAI/Google ≈ $60M each; News Corp/OpenAI ≈ $250M) are reshaping who profits from content. At the same time the open web is closing (Common Crawl index down ~24% vs. 2022) and litigation over scraping and copyright (Anthropic’s recent $1.5B settlement) is raising costs and limiting casual access. That combination—soaring demand, legal friction, and novel, expert-curated datasets—makes transparent valuation and new market mechanisms urgent for the AI/ML community. To address this, Portex Datalab proposes moving datasets toward public-style price discovery using auction formats (English, Vickrey, Dutch) and a fundamentals-based valuation “Zestimate” for data. Their baseline model multiplies usable token count (s) by a per-token baseline (T ≈ $0.001), adjusted by modality (M: text=1, text+code≈1.1–1.4, audio 5–10, video 10–20), a use-case scaler (U: 0.1–1), a uniqueness scaler (N: 0.01–10), and a freshness/decay factor (F). Auctions plus such signal-driven reserve pricing aim to reduce opacity, limit winner’s curse, and give original data owners clearer compensation—shifting the market from bespoke, broker-driven deals toward faster, more efficient pricing that better aligns incentives for producing frontier AI data.

Loading comments...

loading comments...