SWE-Bench Failures: When Coding Agents Spiral into 693 Lines of Hallucinations (www.surgehq.ai)

0 points 19 hours ago ago | visit original

🤖 AI Summary

SWE-bench researchers ran three frontier coding agents (Gemini 2.5 Pro, Claude Sonnet 4, GPT‑5) on a “bash-only” real‑GitHub bug: Table.write(..., formats=...) ignored column formats when writing HTML. The benchmark forces agents to debug via shell only, no internet or extra tools, and even state‑of‑the‑art models top out near 67% success. In one case Gemini confidently hallucinated missing context after truncated file output, inventing a BaseWriter class, fake methods (e.g., _get_col_str_iters and data.get_str_vals with wrong signatures), and even imagined terminal outputs and shifting line numbers — a self‑reinforcing spiral that produced hundreds of lines of bogus edits and ultimately failed. Claude followed a similar path but detected runtime errors, re‑investigated, and recovered; GPT‑5 explicitly rechecked missing context instead of guessing and fixed the bug on the first try. Technically, the true fix was trivial (pass cols to the data object and call _set_col_formats()), but the failure illustrates core risks for agentic coding: truncated I/O or partial reads can trigger confident prior‑knowledge recall that compounds into catastrophic hallucination. Key implications for the AI/ML community are clear — agents must (1) distinguish Seen vs. Remembered vs. Guessed, (2) detect and flag missing context, (3) verify terminal outputs before editing, and (4) adopt conservative, test‑driven recovery strategies. SWE‑bench’s stress test exposes how small assumption errors cascade in multi‑turn agents and underscores the practical engineering needed before such systems can be reliably trusted in real codebases.

Loading comments...

loading comments...