To solve the benchmark crisis, evals must think (blog.fig.inc)

🤖 AI Summary
Frontier benchmarks have collapsed into a mirage: GPT‑4, Claude and Gemini all score ~95% on HumanEval and top out MMLU/GSM8K, yet real deployments still fail on routine customer tasks. The article argues this gap is driven by dataset contamination and metric gaming—models memorizing test examples (Sun et al., 2025), RL fine‑tuning that exploits benchmark patterns (Goodhart’s Law), and static test sets that models can “study for.” Evidence includes models reproducing pre‑cutoff problems and teams whose models aced algorithmic puzzles but failed to refactor real React code. Static benchmarks hence cease to measure general capability and create dangerous deployment illusions. The proposed fix is to reconceive evaluations as adaptive, generative systems: living benchmarks (LiveBench’s monthly, post‑cutoff queries that keep top model accuracy <70%), procedural generation (MCPEval synthesizing executable API tasks), learned simulators that model task difficulty, and adversarial discovery loops (Anthropic-style self‑attacks; DART cuts violation risk ~53%). Crucially, “wild deployment” telemetry (Discord, Prosus) closes the loop so production failures become new test cases. Tradeoffs are nontrivial—10–100× evaluation cost increases, infrastructure/versioning complexity, and standardization tensions—but the payoff is economic: evaluation coverage becomes the gating factor for safe deployment. The author predicts static benchmarks will be obsolete for frontier work by 2026; the winners will be teams that build continual, adversarial, production‑aware eval stacks—not just bigger models.
Loading comments...
loading comments...