🤖 AI Summary
Researchers show that state-of-the-art LLMs’ strong scores on SWE-Bench—an influential benchmark for fixing real-world GitHub issues—may reflect memorization and dataset contamination more than transferable reasoning. Using two diagnostic probes (1) predicting buggy file paths from issue text alone and (2) reproducing ground-truth functions given only the current file context plus the issue description, the authors find striking gaps: models hit up to 76% accuracy at identifying file paths without repository structure access, but only about 53% on tasks from repos not present in SWE-Bench. Likewise, verbatim reproduction (measured by consecutive 5-gram overlap) is as high as 35% on SWE-Bench variants versus only ~18% on other coding benchmarks.
These results imply that reported advances may overstate real-world problem-solving ability and instead reflect memorized code or leaked examples, raising reproducibility and deployment risks. For the AI/ML community this underlines the need for contamination-resistant benchmarks and stricter evaluation protocols (e.g., held-out repositories, time-based splits, synthetic or adversarial tasks) to separate genuine generalization from artifact-driven performance. The paper’s diagnostic tasks offer practical tools to audit benchmarks and guide more robust assessments of LLM coding abilities.
Loading comments...
login to comment
loading comments...
no comments yet