Can Language Models Optimize Real-World Repositories on Real Workloads? (swefficiency.com)

0 points 2 hours ago ago | visit original

🤖 AI Summary

SWE-efficiency (pronounced “swee-FISH-uhn-see”) is a new benchmark that tests whether language models can perform real-world performance engineering: given a complete Python codebase and an actual workload, agents must locate bottlenecks, identify relevant tests, and produce patches that preserve correctness while matching or exceeding expert speedups. The suite contains 498 tasks drawn from nine popular Python repositories (including numpy, pandas, and scipy) and evaluates solutions with a Speedup Ratio (SR) metric: SR = (LM speedup) / (Expert speedup), where SR = 1.0× means the model equals human expert improvement. The results reveal a large gap between LMs and human performance: state-of-the-art agents average under 0.15× the expert speedup. Models struggle to localize optimization opportunities, reason about execution and data flows across functions and files, and produce edits that both improve runtime and pass existing unit tests. For the AI/ML community this highlights limitations of current code-generation and program-repair techniques for performance tasks and underscores the need for integrated dynamic analysis (profiling), better cross-file semantic reasoning, and test-driven synthesis. SWE-efficiency provides a realistic, reproducible benchmark to drive those advances and measure progress toward production-ready performance optimization agents.

Loading comments...

loading comments...