AI-generated tests are lying to you (davidadamojr.com)

🤖 AI Summary
AI tools like GitHub Copilot, Cursor and Claude Code have made it trivial to auto-generate unit tests, and teams are happily shipping green checkmarks and high coverage reports — but a subtle trap has emerged. LLM-generated tests typically read the implementation and synthesize expectations that match the current code, not the intended behavior. That produces “self-fulfilling” tests that validate bugs (e.g., a divide-by-zero case where the code returns 0 and the test asserts that same behavior) and gives teams false confidence. The result is a broken feedback loop: coverage and speed improve while correctness and intent do not. There are legitimate uses — characterization tests (per Michael Feathers) are useful for freezing legacy behavior — but most teams misuse AI to replace thinking for new code. Practical mitigations: prompt models with requirements before coding (TDD-style), ask for failure modes and edge-case brainstorming rather than success paths, use AI to expand boundary/fuzz inputs, and measure test strength with mutation testing tools like MutPy or PIT so generated tests must “kill” injected bugs. Bottom line: LLMs are excellent at producing artifacts but not at judging correctness; use them to amplify test design and intent, not to automate judgment.
Loading comments...
loading comments...