Why most AI coding benchmarks are misleading (COMPASS paper) (arxiv.org)

🤖 AI Summary
Researchers released COMPASS, a multi-dimensional benchmark that argues current AI code-evaluation suites are misleading because they reward only functional correctness. COMPASS comprises 50 real-world competitive programming tasks sourced from Codility and leverages a massive human baseline of 393,150 submissions. Instead of treating any solution that passes tests as equal, it measures three axes: correctness, algorithmic efficiency (runtime and complexity), and code quality (using industry-standard static analysis and maintainability metrics). The paper evaluates three state-of-the-art reasoning-enhanced models—Anthropic Claude Opus 4, Google Gemini 2.5 Pro, and OpenAI O4‑Mini‑High—and shows that high correctness scores often mask poor algorithmic choices or low-quality, hard-to-maintain code. This matters because production-grade coding assistants must produce not just working code but efficient, secure, and maintainable implementations. COMPASS exposes where models shortcut—e.g., relying on brute-force or test-specific hacks that pass unit tests but scale poorly—and quantifies those failures against realistic human performance. Technical implications include the need to incorporate asymptotic complexity and static-quality signals into training and evaluation, to develop reward functions that favor algorithmic reasoning and clean code, and to use richer benchmarks (like COMPASS) when judging readiness for real-world deployment.
Loading comments...
loading comments...