Large Language Models Are Automatic Code Benchmark Generators (autocodebench.github.io)

0 points 3 hours ago ago | visit original

🤖 AI Summary

Researchers introduce AutoCodeGen, an automated pipeline that has LLMs generate test inputs, run them in a secure execution sandbox, and reverse-engineer programming problems from the resulting input-output behavior. Using this workflow and a high-performance MultiLanguageSandbox (compilation/execution support for >30 languages), they built AutoCodeBench: a large-scale, multilingual code generation benchmark of 3,920 high-difficulty problems evenly spread across 20 languages. From evaluations of 30+ open- and closed-source models they distilled AutoCodeBench-Lite (1,586 problems solved by at least two models) for model-comparison and AutoCodeBench-Complete (1,000 problems with 3-shot prompts) to probe base-model completion abilities. The contribution matters because it automates high-quality benchmark construction at scale, improving test-case diversity and coverage compared with prior approaches (e.g., KodCode, CodeI/O) and enabling robust multilingual evaluation. Key findings: performance gaps across models are small on popular languages but widen sharply for low-resource languages; LLMs struggle on multi-logic programming tasks; and models exhibit parameter- and test-time sampling-dependent scaling laws on this benchmark. The multilingual sandbox feedback loop also enables iterative code refinement by models. AutoCodeGen/AutoCodeBench thus provide a scalable toolkit to stress-test and guide development of code-capable LLMs, especially where language diversity and complex logic are critical.

Loading comments...

loading comments...