Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult (simonwillison.net)

🤖 AI Summary
Anthropic released Claude Opus 4.5, pitching it as “best in the world for coding, agents, and computer use.” Technically it matches Sonnet’s long-context capabilities with a 200,000‑token context window and a 64,000‑token output limit, and uses a March 2025 “reliable knowledge cutoff.” Pricing was sharply reduced to $5/million input and $25/million output (down from $15/$75 in Opus 4), making it more competitive with OpenAI’s GPT‑5.1 and Google’s Gemini 3 (both recent challengers). Key new features include an effort parameter (high/medium/low) to trade thoroughness for speed, a “zoom” tool for targeted screen inspection to support enhanced computer use, and preservation of thinking blocks across turns. In hands‑on preview the model drove substantive engineering work—an alpha sqlite‑utils refactor spanning 20 commits, 39 files, ~2,022 additions and 1,173 deletions—demonstrating concrete productivity gains. But the release highlights a growing problem: distinguishing real capability leaps among frontier LLMs is getting harder. The author found Opus 4.5 excellent yet couldn’t reliably prove it outperformed Sonnet 4.5 on everyday coding tasks after switching back, echoing broader bench results that show single‑digit percentage gains. The implication for the AI/ML community is clear: better evaluation methods are needed—curated, hard‑failing task sets and concrete “this worked on the new model but not the old” examples from labs would be far more useful than marginal benchmark bumps.
Loading comments...
loading comments...