🤖 AI Summary
Macroscope announced results from an internal benchmark comparing AI code-review tools on a curated dataset of 118 real-world, self-contained runtime bugs drawn from 45 popular open-source repositories spanning eight languages (Go, Java, JavaScript, Kotlin, Python, Rust, Swift, TypeScript). Using LLMs to classify commits, generate human-readable bug descriptions and help identify introducing commits via git blame, the team created PRs that simulated pre-bug and buggy states and ran five tools (Macroscope, CodeRabbit, Cursor Bugbot, Greptile, Graphite Diamond) with default settings. An LLM then matched tool review comments to the known-bug descriptions, and matches were manually spot-checked. Macroscope led overall detection at 48%, followed by CodeRabbit 46%, Cursor Bugbot 42%, Greptile 24% and Graphite Diamond 18%. Language highlights: Macroscope excelled in Go (86%), Java (56%), Python (50%) and Swift (36%); CodeRabbit led JavaScript (59%) and Rust (45%).
The benchmark is significant because it measures practical bug-detection performance on real commits rather than synthetic tests, and it contrasts detection rate with comment volume (CodeRabbit was the “loudest,” Graphite Diamond the quietest, Macroscope mid-tier). Key caveats: only self-contained runtime bugs were evaluated, tools ran with default/minimum plans (no custom rules), sample sizes varied (Greptile was partially disabled), and LLM-assisted labeling introduces potential bias. Practitioners should view these results as actionable but not definitive—tool choice still depends on target languages, noise tolerance, and customization needs.
Loading comments...
login to comment
loading comments...
no comments yet