ProgramBench: Can Language Models Rebuild Programs from Scratch? (arxiv.org)

🤖 AI Summary
A recent study introduced ProgramBench, a new benchmarking framework aimed at evaluating the capability of language models (LMs) to construct software projects from scratch based solely on program documentation. Unlike existing benchmarks that focus on narrow tasks like bug fixing or feature development, ProgramBench presents a comprehensive challenge where LMs must architect and implement entire codebases matching a reference executable's behavior. Through agent-driven fuzzing, the framework generates end-to-end behavioral tests across 200 diverse tasks, ranging from simple command-line tools to complex applications like FFmpeg and SQLite. This initiative is significant for the AI/ML community as it emphasizes the need for advanced evaluation methods that reflect real-world software engineering demands. The study's findings revealed that while nine LMs were evaluated, none could fully succeed in any task, with the most capable model achieving only 95% of tests on just 3% of the challenges. The results indicate that current language models tend to favor simpler, monolithic implementations, diverging from established coding practices. This highlights critical gaps in the ability of LMs to handle sophisticated software design, urging further development in AI-driven software engineering methodologies.
Loading comments...
loading comments...