SWE-Bench Pro (github.com)

🤖 AI Summary
ScaleAI released SWE-Bench Pro, a harder benchmark and accompanying code/data for evaluating LLMs and agent systems on long‑horizon software engineering tasks. Each instance gives a real codebase plus an issue description and requires the model to produce a working patch; the suite is inspired by SWE-Bench and is designed to stress reasoning, multi-step code edits, and environment-aware fixes rather than single-line completions. The dataset is available via Hugging Face (load_dataset('ScaleAI/SWE-bench_Pro', split='test')), and the repo provides an evaluation harness that runs reproducibly in Docker. Technically, SWE-Bench Pro uses prebuilt Docker images (hosted under jefzda/sweap-images:{repo_base}.{repo_name}__{repo_base}-{repo_name}-{hash}) and a Modal-based orchestration to scale evaluations. To run evaluations you must install modal, run modalv setup to store credentials (~/.modal.toml), generate patch predictions with your preferred harness, then call the provided evaluator (sweap_pro_eval_modal.py) pointing to the raw CSV (external_hf_v2.csv) and your patch JSON (e.g., gold_patches.json). The evaluator supports parallel workers (e.g., --num_workers=100) and requires your Docker Hub username. SWE-Bench Pro therefore emphasizes reproducible, environment-aware benchmarking for code-generating systems and provides an infrastructure-ready workflow for stress-testing real-world repair capabilities.
Loading comments...
loading comments...