Failing to Understand the Exponential, Again (www.julian.ac)

🤖 AI Summary
Recent analyses argue that AI progress is still following a clear exponential trajectory, and that treating current imperfections as evidence of a plateau is misleading. METR’s "Measuring AI Ability to Complete Long Tasks" shows models steadily extending the duration of software-engineering tasks they can complete autonomously (Sonnet 3.7 hits ~50% success for ~1-hour tasks), with a reported doubling cadence of roughly seven months. METR’s live plot now places Grok 4, Opus 4.1 and GPT-5 at or slightly above that trend, with some models reliably handling tasks longer than two hours. Acknowleding a potential test-set bias toward engineering work, the pattern holds in OpenAI’s GDPval: a large, blinded evaluation (44 occupations, 9 industries, 30 tasks per occupation — 1,320 tasks total, graded by experienced professionals) where GPT-5 approaches human levels and Claude Opus 4.1 matches or outperforms the trend and comes close to industry experts. Why it matters: cross-study convergence — task-duration scaling in METR plus broad occupational benchmarking in GDPval — implies that models are improving across domains, not just in narrow benchmarks. The authors project pragmatic milestones (models doing full 8-hour workdays by mid-2026; matching expert performance across many industries by end-2026; outperforming experts by end-2027). The takeaway for AI/ML practitioners and policymakers is to treat exponential capability gains seriously, factor rapid timelines into deployment and safety planning, and prefer empirical trend analysis over intuition anchored to occasional errors or incremental-release impressions.
Loading comments...
loading comments...