Claude Code Daily Benchmarks for Degradation Tracking (marginlab.ai)

0 points 144 days ago ago | visit original

🤖 AI Summary

Claude Code has introduced a new performance tracker for its Opus 4.5 model aimed at monitoring potential degradations in software engineering tasks (SWE). This tracker provides daily benchmarks on a curated subset of SWE-Bench-Pro, employing statistical methods to detect significant performance drops. The tracking system runs evaluations directly within the Claude Code Command Line Interface (CLI), excluding custom setups to ensure results accurately reflect user experiences. The baseline pass rate is set at 58%, with defined thresholds for statistical significance at ±14.0% for daily metrics and ±5.6% for weekly assessments. This initiative is particularly significant for the AI/ML community, offering a transparent and independent resource to identify performance variations over time, amidst concerns raised in September 2025 regarding model degradations. By using Bernoulli random variable modeling and computing 95% confidence intervals, the tracker promises reliable and timely data points for developers and researchers alike. The approach holds implications for maintaining model integrity and performance expectations, fostering increased trust in AI tools while providing actionable insights on model reliability.

Loading comments...

loading comments...