Code Review Benchmark (blog.macroscope.com)

🤖 AI Summary
Macroscope announced results from an internal benchmark comparing AI code-review tools on a curated dataset of 118 real-world, self-contained runtime bugs drawn from 45 popular open-source repositories spanning eight languages (Go, Java, JavaScript, Kotlin, Python, Rust, Swift, TypeScript). Using LLMs to classify commits, generate human-readable bug descriptions and help identify introducing commits via git blame, the team created PRs that simulated pre-bug and buggy states and ran five tools (Macroscope, CodeRabbit, Cursor Bugbot, Greptile, Graphite Diamond) with default settings. An LLM then matched tool review comments to the known-bug descriptions, and matches were manually spot-checked. Macroscope led overall detection at 48%, followed by CodeRabbit 46%, Cursor Bugbot 42%, Greptile 24% and Graphite Diamond 18%. Language highlights: Macroscope excelled in Go (86%), Java (56%), Python (50%) and Swift (36%); CodeRabbit led JavaScript (59%) and Rust (45%). The benchmark is significant because it measures practical bug-detection performance on real commits rather than synthetic tests, and it contrasts detection rate with comment volume (CodeRabbit was the “loudest,” Graphite Diamond the quietest, Macroscope mid-tier). Key caveats: only self-contained runtime bugs were evaluated, tools ran with default/minimum plans (no custom rules), sample sizes varied (Greptile was partially disabled), and LLM-assisted labeling introduces potential bias. Practitioners should view these results as actionable but not definitive—tool choice still depends on target languages, noise tolerance, and customization needs.
Loading comments...
loading comments...