Gdpval: Evaluating AI Model Performance on Real-World Economically Valuable Task [pdf] (cdn.openai.com)

🤖 AI Summary
OpenAI introduced GDPval, a new benchmark that measures AI performance on real-world, economically valuable tasks drawn from 44 occupations across the nine U.S. GDP‑leading sectors. The full set contains 1,320 tasks (≥30 tasks per occupation) and a gold open subset of 220 tasks (5 tasks per occupation), each created and validated by industry experts (average 14 years’ experience) and tied to time/cost estimates — tasks average seven hours of expert work and can span up to weeks. GDPval emphasizes realism and multimodality (CAD, images, audio, slide decks, spreadsheets), uses blinded pairwise human expert comparisons as its primary metric (win‑rate vs. human baseline), and provides an experimental automated grader and a public grading service at evals.openai.com. Results show frontier models improving roughly linearly over time and approaching industry expert quality on many deliverables. In blind pairwise tests, Claude Opus 4.1 led on aesthetics while GPT‑5 led on accuracy; the automated grader reached 66% agreement with human graders (within 5 percentage points of human inter‑rater agreement). The study finds that more reasoning steps, richer context, and scaffolding boost model performance, and that model+human workflows can be faster and cheaper than unaided experts. GDPval offers a practical, extensible way to assess economic impact and deployment readiness of AI, providing a benchmark better aligned to labor‑market realities than conventional academic tests.
Loading comments...
loading comments...