🤖 AI Summary
OpenAI has published GDPval, a new evaluation framework that measures how well large language models perform real-world, work-related tasks by comparing model outputs to industry experts across 44 occupations and nine industry sectors. In the study—conducted by OpenAI’s Economic Research team with Harvard economist David Deming for the NBER—Anthropic’s Claude Opus 4.1 topped the ranking with a 47.6% “win rate” (times the AI beat a human expert), followed by “ChatGPT‑5 high” at 38.8% and “ChatGPT o3 high” at 34.1%; ChatGPT‑4o scored lowest at 12.4%, trailing Grok 4 and Gemini 2.5 Pro. Tasks were practical workplace examples—e.g., replying to a dissatisfied customer, optimizing a vendor-fair table layout, and auditing purchase-order price inconsistencies—and Claude led in eight of nine sectors including healthcare and government.
GDPval is named to echo GDP as an economy-level indicator and aims to shift model evaluation from abstruse benchmarks to evidence-based, productivity-focused measures. Technically, it emphasizes domain-specific, multi-occupation assessments and human-grounded comparisons rather than synthetic tasks, which can change how teams choose and tune models for real work. The transparent release—showing a competitor outpacing OpenAI’s own models—signals a push toward measuring practical utility over headline benchmarks and could influence research priorities, procurement decisions, and future model development to favor workplace effectiveness.
Loading comments...
login to comment
loading comments...
no comments yet