🤖 AI Summary
Anthropic’s new smallish Claude variant, Haiku 4.5, was benchmarked on interactive fiction (text-adventure) play. An initial run looked unusually strong, but repeated trials showed Haiku 4.5 performs roughly on par with Google’s Gemini 2.5 Flash (regression coefficient −0.01 vs Flash’s 0.00) while being about twice as expensive ($5.0 vs $2.5 per million tokens) and a bit slower, so it’s not a cost-effective choice for this task. Top performers were Claude Sonnet 4.5 (+0.12, $15/Mt) and GPT-5 (+0.10, $10/Mt); several cheaper/open-weight models (gpt-oss variants, Qwen 3 Coder) fared significantly worse.
The author used a regression-based score normalized to Gemini 2.5 Flash and reported per-model uncertainty and sample counts; Gemini Flash had the lowest uncertainty (0.06). They note surprising outcomes—Grok 4 and Gemini 2.5 Pro underperform despite higher cost—and that glm 4.6 increased price without meaningful gains. Methodological critiques include turn-budget bias: a cash-budget (limiting output words proportional to cost) would better equalize evaluation time across models. Game-level variability (e.g., So Far has high variance) also drives noise. Practical implication: Haiku 4.5 isn’t recommended for automated play/transcription work given current cost-performance tradeoffs, and future benchmarks should account for token-cost and per-game variance.
Loading comments...
login to comment
loading comments...
no comments yet