Show HN: New Benchmark from SWE-bench team is 0% solved (programbench.com)

🤖 AI Summary
The SWE-bench team has announced a new benchmark, ProgramBench, which challenges language models to recreate programs exclusively using a compiled binary and its documentation, without access to source code, decompilation, or the internet. Despite the potential of advanced models like GPT-5 and Claude, all tested models recorded a dismal 0% success rate in fully solving the 200 tasks designed to cover a range of software complexities. This benchmark emphasizes the difficulty of program generation from scratch, requiring models to make design decisions solely based on behavioral tests, which number over 248,000. This initiative is significant for the AI/ML community as it highlights the current limitations of single-agent systems in software engineering tasks. Unlike previous benchmarks that allowed for harness tuning or leveraged existing code repositories, ProgramBench maintains a cleanroom implementation approach, hamstrung by stringent limitations to prevent cheating. The results underscore the need for advancements in model capabilities and may encourage the exploration of multi-agent systems to overcome these complexities. Overall, ProgramBench presents a rigorous and unsparing evaluation of AI performance in real-world coding scenarios, paving the way for deeper exploration into AI's programming capabilities.
Loading comments...
loading comments...