AutoCodeBench: Large Language Models Are Automatic Code Benchmark Generators (github.com)

🤖 AI Summary
Researchers introduced AutoCodeGen and AutoCodeBench, an automated workflow and large-scale benchmark that turn LLMs into automatic code-benchmark generators. AutoCodeGen uses an LLM–sandbox interaction loop where models generate problems, test inputs and corresponding outputs inside a secure execution sandbox. The result, AutoCodeBench, contains 3,920 curated problems across 20 programming languages with balanced category distribution and higher difficulty than prior multilingual sets. From extensive evaluation of over 30 open- and closed-source models the team produced AutoCodeBench-Lite (1,586 problems solved by at least two models) and AutoCodeBench-Complete (1,000 problems formatted for 3-shot completion-style evaluation). A MultiLanguageSandbox service (Docker image provided) executes and validates solutions across 30+ languages, supporting high-concurrency evaluation. Technically, each dataset entry includes a problem statement, canonical solution, a public demo_test_func with basic cases and a private full_test_func for robust evaluation; outputs can be recorded in an "output" field for automated scoring. The repo demonstrates end-to-end evaluation with call_sandbox.py and VLLM examples and is hosted on HuggingFace for download. By automating dataset generation and providing balanced, high-difficulty multilingual tasks, AutoCodeBench enables more realistic, scalable assessment of code-generation capabilities and helps uncover language- and difficulty-specific weaknesses that earlier Python-centric benchmarks missed.
Loading comments...
loading comments...