OpenAI says GPT-5 stacks up to humans in a wide range of jobs (techcrunch.com)

0 points 1 day ago ago | visit original

🤖 AI Summary

OpenAI published GDPval, a new benchmark designed to measure how AI performs on economically valuable work, and reported that its GPT-5-high and Anthropic’s Claude Opus 4.1 are already “approaching” industry expert quality. GDPval-v0 samples 44 occupations across nine industries that drive U.S. GDP (healthcare, finance, manufacturing, government, etc.) and asks experienced professionals to judge AI-generated reports versus human-produced ones; models earn a win/tie rate averaged across tasks. GPT-5-high scored 40.6% wins/ties, Anthropic’s Claude Opus 4.1 scored 49% (OpenAI notes Claude’s score may be partly driven by presentation/graphics), while GPT-4o scored 13.7% about 15 months earlier—highlighting rapid improvement. The result matters because it shifts evaluation from stylized academic benchmarks toward real-world, economically relevant tasks, offering evidence that current LLMs can already augment professional work by handling report-style outputs and freeing humans for higher-value activities. GDPval’s limited scope—focused on static reports, not full interactive workflows or broader job responsibilities—means it’s not a prediction of imminent mass job displacement, but a useful signal that capabilities are advancing. For the AI/ML community, GDPval underscores the need for richer, domain-grounded benchmarks and careful interpretation of “human-level” claims as models move closer to impacting industry workflows.

Loading comments...

loading comments...