AI benchmarks are a bad joke – and LLM makers are the ones laughing (www.theregister.com)

🤖 AI Summary
A multi-institution study led by the Oxford Internet Institute examined 445 LLM benchmarks and found just 16% use rigorous scientific methods to compare model performance. Roughly half of the benchmarks claim to measure fuzzy constructs like “reasoning” or “harmlessness” without defining them, and 27% rely on convenience sampling rather than principled designs (e.g., random or stratified sampling). The paper highlights concrete failings — such as math benchmarks like AIME using problem sets whose crafted numbers mask weaknesses on larger inputs — and notes how vendors weaponize these scores in marketing (OpenAI’s GPT‑5 rollout cited high AIME, coding and multimodal scores). The authors propose an eight‑point checklist (define the construct, prevent contamination, apply proper statistical comparisons, etc.) to restore construct validity and reproducibility; coauthors come from EPFL, Stanford, TUM, UC Berkeley, Yale and others. The findings matter because benchmarks underpin SOTA and AGI claims, can be gamed by search‑capable agents or tailored datasets, and even influence commercial deals (the report notes industry uses of internal “AGI” metrics, including profit‑based thresholds). The paper and concurrent moves like ARC Prize Verified underscore an urgent need for standardized definitions, contamination controls, and stronger statistical rigor so benchmark scores meaningfully reflect real model capabilities.
Loading comments...
loading comments...