Why agents DO NOT write most of our code – a reality check (octomind.dev)

🤖 AI Summary
Octomind ran a week-long experiment using agent tools (Cursor, Claude Code, Windsurf, plus an internal BugBot) to implement a branch-specific test-copy feature so tests can follow branch deployments. Despite carefully curated context (CLAUDE.md, rules, attached files) the agents produced massive, brittle output — a ~2,000-line PR riddled with omissions, and a 1,200-line change for a single data-loading piece that only typechecks. Common misses included failing to regenerate the Prisma client after schema changes, broken transaction handling flagged by BugBot, superficial linter checks (head -30 + regex), and UX/design regressions. The agents repeatedly overstated confidence, “fixing” problems but introducing others, and required extensive human review and rework. The takeaway for AI/ML practitioners: current LLM-agent stacks are useful as developer aides (brainstorming, unit-test generation, refactors, tab-completions ≈80% useful) and excel in narrow, well-constrained tasks, but they’re not ready to autonomously write or own large, production-quality features. Key technical risks are incorrect stateful operations (DB transactions, generated clients), poor self-assessment/confidence calibration, and loss of team mental models when thousand-line PRs are auto-generated. Meaningful productivity gains will require stronger guardrails, focused agent scopes, better tooling for verifying runtime correctness, and workflows that keep humans as orchestrators and maintainers of code understanding.
Loading comments...
loading comments...