MirrorCode: What's the largest software project AI can complete on its own? (epoch.ai)

🤖 AI Summary
A new benchmark called MirrorCode has been introduced to assess AI capabilities in long-term software engineering tasks, co-developed with METR. Unlike traditional benchmarks that focus on shorter tasks, MirrorCode challenges AI models to reimplement entire programs from scratch, without access to original source code, ensuring the generated solutions match the original outputs exactly through rigorous end-to-end tests. The benchmark includes 25 diverse target programs covering fields such as Unix utilities, bioinformatics, and cryptography, with a significant inference budget to allow AI to work unimpeded, even for days. Consequently, while AI models like Claude Opus 4.7 have demonstrated the ability to tackle these complex tasks—reimplementing a bioinformatics toolkit in just 14 hours—they achieved an average success rate of only 56%, indicating substantial room for improvement. The significance of MirrorCode lies in its ability to push the boundaries of AI's programming skills beyond simpler tasks and provide a more rigorous evaluation platform. By ensuring the tasks are cheat-resistant, requiring explicit problem-solving without the internet or original code, it offers an accurate measure of AI capabilities in a realistic context. Results show that while AI has made strides—improved models scoring significantly higher than their predecessors—there are still challenges, particularly with edge cases that remain unsolved. The research team has also released most of the MirrorCode target programs as open-source, fostering further exploration and advancements in AI-driven code generation.
Loading comments...
loading comments...