Benchmarking GPT-5.1 vs. Gemini 3.0 vs. Opus 4.5 across 3 Coding Tasks (blog.kilo.ai)

0 points 18 hours ago ago | visit original

🤖 AI Summary

Three leading LLMs—OpenAI’s GPT-5.1, Google’s Gemini 3.0, and Anthropic’s Opus 4.5—were benchmarked on three coding tasks (strict prompt adherence, large-scale TypeScript refactor, and system-understanding + extension) to evaluate fidelity, security-awareness, and real‑world extension capabilities. The Prompt Adherence test forced rigid requirements (exact class/method names, time.monotonic(), threading.Lock(), etc.): Gemini 3.0 followed the spec most literally (highest adherence), GPT-5.1 added defensive input validation and extra checks, and Opus 4.5 hit a middle ground with better docstrings but a minor naming mismatch. In the TypeScript refactor, Opus 4.5 delivered the most complete fix (100/100), implementing requested rate limiting, env var secrets, headers, and layered architecture; GPT-5.1 fixed 9/10 requirements and added transactions and backward-compatible field handling; Gemini missed some deeper security fixes. For the notification extension, Opus produced the most exhaustive email support (templates for 7 events, runtime template management), GPT-5.1 produced a detailed architectural audit (diagrams, line-cited bugs) and full-featured email handlers (CC/BCC, attachments), while Gemini delivered a minimal but functional implementation. Key takeaways for engineers: GPT-5.1 is “defensive and thorough” (more code, JSDoc, explicit types, catches subtle bugs), Gemini 3.0 is “minimal and fast/cheap” (shorter outputs, literal interpretation), and Opus 4.5 is “balanced and complete” (strict typing, custom error classes, most features implemented). Opus was fastest overall and highest-scoring but costlier ($1.68 vs. Gemini $1.10); GPT-5.1 produces 1.5–1.8x more code than Gemini. Choose Gemini for concise, spec-exact outputs, GPT-5.1 when you want safety/compatibility baked in, and Opus 4.5 when you need a full, production-ready pass on the first run.

Loading comments...

loading comments...