How We Made SWE-Bench 50x Smaller (logicstar.ai)

🤖 AI Summary
SWE-Bench Verified’s 500 containerized environments — previously distributed as hundreds of Docker images totaling 240 GiB (240 GiB expanded from >100 GiB compressed) and taking hours or even ~30 hours under Docker Hub rate limits to set up — have been reworked into a single 5 GiB compressed archive (31 GiB uncompressed). The team used a combination of techniques so the benchmark now downloads and unpacks in minutes (decompression ~40s on one core; the parallel compression run took ~10 minutes on 100 cores). The result: large-scale evaluation and trace generation on ephemeral cloud machines is fast and practical. Key technical moves: “delta layering” chains instance layers chronologically per repository so each instance adds only the diff from the previous commit (instead of full repo copies), avoiding duplication across 63 environment layers and 500 instance layers (Django split into two chains to respect Docker’s 125-layer limit). They also restructured git packfiles to produce one packfile per instance (trading some git compression for tiny incremental layers), removed unnecessary build artifacts (installers, pip/conda caches), and applied cross-layer zstd compression with layers sorted by chain order to maximize redundancy. The approach is broadly applicable to other execution-environment benchmarks; helper scripts are provided and the archive is hosted on Hugging Face.
Loading comments...
loading comments...