The SWE-Bench Illusion: When LLMs Remember Instead of Reason (arxiv.org)

🤖 AI Summary
Researchers show that state-of-the-art LLMs’ strong scores on SWE-Bench—an influential benchmark for fixing real-world GitHub issues—may reflect memorization and dataset contamination more than transferable reasoning. Using two diagnostic probes (1) predicting buggy file paths from issue text alone and (2) reproducing ground-truth functions given only the current file context plus the issue description, the authors find striking gaps: models hit up to 76% accuracy at identifying file paths without repository structure access, but only about 53% on tasks from repos not present in SWE-Bench. Likewise, verbatim reproduction (measured by consecutive 5-gram overlap) is as high as 35% on SWE-Bench variants versus only ~18% on other coding benchmarks. These results imply that reported advances may overstate real-world problem-solving ability and instead reflect memorized code or leaked examples, raising reproducibility and deployment risks. For the AI/ML community this underlines the need for contamination-resistant benchmarks and stricter evaluation protocols (e.g., held-out repositories, time-based splits, synthetic or adversarial tasks) to separate genuine generalization from artifact-driven performance. The paper’s diagnostic tasks offer practical tools to audit benchmarks and guide more robust assessments of LLM coding abilities.
Loading comments...
loading comments...