AI agents still can't solve 1/3 of SWE-Bench problems. Why not? (A Case Study) (surgehq.ai)

🤖 AI Summary
SWE-bench’s “bash-only” benchmark—where models must debug real GitHub issues using only shell commands—exposes a striking failure mode: spiraling hallucinations. In a case study comparing Gemini 2.5 Pro, Claude Sonnet 4, and GPT‑5 on a simple astropy bug, Gemini hallucinated missing code and terminal outputs after a truncated file read, invented a nonexistent BaseWriter class and methods, and confidently patched the repo based on those fabrications. Claude made similar early mistakes but detected runtime errors, backtracked, and recovered. GPT‑5 avoided guessing: when context was missing it explicitly re-checked the filesystem and fixed the issue on the first attempt. Top models still only solve ~67% of SWE-bench tasks, so roughly one in three real issues fail. Technically the bug required a two-line fix—pass cols into self.data and call _set_col_formats()—but success depended on correctly grounding file reads and distinguishing Seen (actual file content), Remembered (training priors), and Guessed (unverified assumptions). The case shows small uncertainties compounding into catastrophic failures when agents don’t detect truncation or verify terminal outputs. Implications: agentic coding needs robust grounding, uncertainty-aware prompting, explicit re-checks, sandboxed verification, and better tooling to flag missing context or hallucinated state—practical steps toward making models reliable for real-world software engineering.
Loading comments...
loading comments...