GDPVal: Measuring the performance of our models on real-world tasks (openai.com)

🤖 AI Summary
OpenAI introduced GDPval, a new benchmark that measures how well models perform on economically valuable, real-world knowledge-work tasks. Built from tasks tied to 44 occupations across nine industries that together drive U.S. GDP, GDPval’s full set contains 1,320 specialized tasks (220 in an open-sourced gold set). Tasks are realistic deliverables—legal briefs, engineering blueprints, slides, spreadsheets, multimedia—each created and reviewed by professionals (avg. 14 years’ experience) through multiple expert review rounds. Occupations were selected using BLS and O*NET data (industries >5% GDP; occupations with ≥60% knowledge-work tasks), ensuring the benchmark targets where AI can most impact productivity. GDPval uses blind expert graders who compare AI and human deliverables using detailed rubrics, supplemented by an experimental automated grader available at evals.openai.com. In blind tests on the gold set, leading models (GPT‑4o, o4‑mini, GPT‑5, OpenAI o3, Claude Opus 4.1, Gemini 2.5 Pro, Grok 4) sometimes matched or beat industry experts; Claude Opus 4.1 led on aesthetics while GPT‑5 excelled on accuracy. Performance reportedly more than doubled from GPT‑4o to GPT‑5, and models can produce outputs roughly 100x faster and cheaper in inference costs (not counting human oversight). OpenAI also showed that fine-tuning, larger models, richer context, and more reasoning steps boost GDPval performance. GDPval is intentionally an early, one-shot test with planned extensions toward interactive, context-rich workflows and a public gold subset and grading service to encourage broader research.
Loading comments...
loading comments...