Is Opus 4.5 really 'the best model in the world for coding'? It just failed half my tests (www.zdnet.com)

🤖 AI Summary
Anthropic’s new Claude Opus 4.5, billed as “the best model in the world for coding,” was put through a four-test coding battery and scored 50%. In Test 1 it generated a WordPress plugin across PHP, JavaScript and CSS (312-line PHP, 178-line JS, 133-line CSS) but bungled file delivery—combining files into one unusable bundle, failing the download link, and embedding human-readable documentation into the JS (not commented out), which would have broken execution. After manual extraction and cleanup the UI rendered but core actions (Randomize, Clear) didn’t work. Test 2 returned a repaired JavaScript currency validator that rejected valid inputs like "12.", ".5", and "000.5", crashed on null values, and mishandled excess precision ("12.345"), so it failed. Opus 4.5 passed Test 3 (a deep PHP/WordPress debugging task) and Test 4 (coordinating AppleScript, Chrome, and Keyboard Maestro), showing it can handle framework-specific reasoning and multi-tool scripting in some cases. The takeaway for the AI/ML community: Opus 4.5 shows strong capabilities in complex, framework-aware analysis and agentic workflows but remains unreliable in a plain chatbot context for hands‑off code generation and file handling. Practical implications include the need for careful validation, iterative prompting, and human supervision—especially when generated code will be executed or shipped. These failure modes (broken file outputs, injected commentary, brittle edge-case handling, and crashes on null inputs) highlight that “best model” claims still require scrutiny and task-specific benchmarking; Anthropic has been contacted for comment.
Loading comments...
loading comments...