The SWE-Bench Illusion (www.microsoft.com)

🤖 AI Summary
A new analysis, "The SWE-Bench Illusion," shows that recent state-of-the-art LLMs’ strong scores on SWE-Bench—an influential benchmark for software-engineering tasks—may reflect memorization or data contamination rather than genuine problem-solving. The authors introduce two diagnostic probes: (1) file-path identification from issue descriptions alone, and (2) ground-truth function reproduction given only the current file context and an issue description. On SWE-Bench Verified, models hit up to 76% accuracy at predicting buggy file paths using only issue text, but that drops to at most 53% on repositories not included in SWE-Bench. For function reproduction, verbatim similarity metrics (consecutive 5-gram accuracy) reach up to 35% on SWE-Bench Verified/Full versus only up to 18% on other benchmarks. These results matter because they suggest published gains may overstate models’ generalizable coding ability: high performance can come from remembering training data or test-set leaks, not from transferable reasoning. Practically, this undermines trust in evaluation, model selection, and deployment decisions for coding assistants. The paper underscores the need for contamination-resistant benchmarks, explicit tests that separate memorization from reasoning, and more rigorous evaluation protocols to reliably measure LLMs’ true software-engineering capabilities.
Loading comments...
loading comments...