What are popular AI coding benchmarks actually measuring? (blog.nilenso.com)

🤖 AI Summary
A developer who built StoryMachine dug into popular AI coding benchmarks and found they measure a much narrower slice of software work than their names imply: mostly tidy, well-scoped tasks judged by unit-test pass rates, not full software engineering. SWE-bench Verified is 500 Python problems (many from Django) with tiny surgical fixes (mean LOC ~11, median 4) and likely training-data leakage; SWE-bench Pro expands to 1,865 problems across Python/Go/JS/TS with larger multi-file fixes (mean LOC ~107, median 55), human-rewritten prompts, and dockerized environments so repo setup isn’t tested. Aider Polyglot uses Exercism-style exercises (225 problems, multi-language) and expects one-round edits to pass tests; LiveCodeBench evaluates competitive-programming skills with hidden test suites (LeetCode-style). The significance: benchmark scores (e.g., a model getting 80% on SWE-bench) overstate real-world utility because they ignore messy but crucial parts of engineering—spec writing, API design, security, maintainability, dependency/setup, and distributional shift. Many benchmarks also risk contamination from training data. Technically, these suites reward producing small, test-passing patches within instrumented environments and predefined interfaces, which is useful for tracking progress but not sufficient to claim models can autonomously handle full software development. The author is cautiously optimistic about coding agents—benchmarks show capability on well-defined units, but true SWE remains a harder, more human-centric problem.
Loading comments...
loading comments...