🤖 AI Summary
The paper introduces the AI Productivity Index (APEX-v1.0), a new benchmark designed to measure whether frontier language models can perform economically valuable knowledge work. APEX contains 200 expert-sourced test cases across four high-value domains—investment banking, management consulting, law, and primary medical care. Expert practitioners (e.g., Goldman Sachs bankers) wrote realistic prompts and grounded rubrics, and the authors evaluated 23 state-of-the-art models using an automated LM judge to score model outputs against those rubrics.
Key findings show sizable headroom before models match human experts: GPT‑5 (Thinking = High) leads with a mean score of 64.2%, followed by Grok 4 (61.3%) and Gemini 2.5 Flash (Thinking = On) at 60.4%; Qwen‑3 235B is the top open-source model and ranks seventh overall. The benchmark highlights domain-specific strengths and weaknesses, offers a practical gauge for deployment-readiness in high-stakes settings, and underscores the need for better measurement and model improvement—especially on tasks requiring nuanced judgment or professional expertise. APEX-v1.0’s expert-driven design and LM-judge evaluation represent a step toward more economically relevant evaluation, though the limited case set and reliance on automated judging suggest further iterations will be needed for broader validity and robustness.
Loading comments...
login to comment
loading comments...
no comments yet