🤖 AI Summary
OpenAI and others have shifted evaluation from trivia benchmarks to real-world, expert-designed tasks: industry veterans (avg. 14 years’ experience) created assignments that take humans 4–7 hours, then blinded graders compared AI and human outputs. Humans still edged out AI, but narrowly; failures were often about formatting or following instructions rather than hallucinations. Newer models score much higher than older ones, and agents like Anthropic’s Claude Sonnet 4.5 (early access) demonstrably performed complex, economically meaningful work—reading a sophisticated economics paper, parsing a data archive, converting STATA code to Python, and reproducing results (verified spot-checks with GPT-5 Pro). This suggests replication/reproducibility chores that once required many hours of expert labor can now be automated at scale, though benchmarking for accuracy and fairness remains necessary.
Technically, the breakthrough comes from agentic models that plan, use tools, and self-correct; small improvements in base model accuracy yield outsized increases in how many chained steps an agent can complete. METR-style measures of task length solvable at ≥50% accuracy show consistent exponential gains from GPT-3 to GPT-5. Practical guidance from OpenAI—delegate to AI as a first pass, review and iterate, then do it yourself if needed—could make expert workflows ~40% faster and ~60% cheaper. The upshot: agents can do real, valuable tasks today (e.g., reproducing research), but they won’t replace whole jobs overnight. The pressing choices are institutional: use agents to amplify high-value work or drown in low-value AI-generated content.
Loading comments...
login to comment
loading comments...
no comments yet