Choosing the best AI coding agent for Bitrise (bitrise.io)

🤖 AI Summary
Bitrise formed a tiger team to evaluate AI coding agents and built an internal, Go-based evaluation framework to benchmark LLMs and agents end-to-end in production-like conditions. The framework declaratively lists agents and test cases, spins up Docker containers, installs agents, clones repos, applies patches, runs programmatic checks (e.g., go test ./...), and uses LLM judges to assess outputs. Tests run in parallel (~10 minutes typical), feed results to a SQL-like DB and Metabase dashboards, and let the team catch regressions automatically with minimal manual overhead—enabling fast, statistically meaningful iteration on non-deterministic systems. After testing Claude Code, Codex, Gemini, and open-source OpenCode, Bitrise concluded they could match Claude Code’s performance while avoiding vendor lock-in by building an in-house agent layered on Anthropic APIs. Claude stood out for MCP support, session persistence (~/.claude) and strong benchmarks, but closed-source constraints and API lock-in were dealbreakers. Other options had trade-offs: Codex wandered in multi-step reasoning, Gemini had unpredictable latency, and OpenCode was flexible but slower and tightly coupled to a TUI. The in-house design uses dynamically constructed Go sub-agents, programmatic checkpoints, centralized logging, and provider-agnostic LLM message storage—giving tighter integration, model-switching mid-conversation, and safer, verifiable AI features for production CI/CD workflows.
Loading comments...
loading comments...