Show HN: New eval from SWE-bench team evalutes LMs based on goals not tickets (codeclash.ai)

🤖 AI Summary
SWE-bench introduced CodeClash, a new open-source benchmark that evaluates LMs on goal-oriented software engineering instead of isolated ticket-fixing. Rather than giving models specific GitHub issues, CodeClash hands them high-level objectives (e.g., increase revenue, survive longer) and lets each model iteratively build and evolve its own codebase over multiple rounds. Every round has an edit phase—where models analyze past logs, refactor, run tests and implement features—and a compete phase—where codebases face off in simulated arenas that score outcomes like income, territory control, or survival. In a large-scale run (8 models × 6 arenas) the team ran 1,680 tournaments with 15 rounds each (25,200 rounds total), producing ~50k agent trajectories; Claude Sonnet 4.5 tops the leaderboard, followed by GPT‑5 and o3. Technically, CodeClash stresses long-horizon planning, automated testing, log analysis, strategy adaptation and lifecycle code maintenance—capabilities standard issue-based benchmarks don’t measure. Early results show models often fail to meaningfully improve across rounds, accumulate technical debt rapidly, and exhibit diverse failure modes; in some arenas (e.g., RobotRumble) human solutions still outpace the best LMs. The benchmark highlights gaps in models’ multi-step decision-making, persistent-state reasoning, and engineering pragmatics, and provides a framework for developing and evaluating systems that must set and pursue outcomes over time.
Loading comments...
loading comments...