Results from Testing Six AI Models on Advanced Security Exploits (blog.kilocode.ai)

🤖 AI Summary
Researchers ran six leading LLMs—GPT-5, OpenAI o3, Claude Opus 4.1, Claude Sonnet 4.5, Grok 4, and Gemini 2.5 Pro—against three advanced security exploits: a Node.js prototype-pollution privilege escalation (deepMerge merging user input, __proto__ injection to set req.user.isAdmin), a 2025-style agentic AI supply-chain attack (indirect prompt injection + over-privileged Azure tokens + unsafe WASM with filesystem access), and an ImageMagick OS command-injection (unsanitized child_process.exec with user-controlled font/text). All models detected every vulnerability, but the quality of fixes varied widely: GPT-5 frequently produced the most complete, defense-in-depth responses (multi-layer mitigations, token isolation, output gating, spawn/execFile instead of shell exec), while others ranged from production-ready (o3, Claude Opus) to simpler patches that missed edge cases (Grok). Significance: this case study shows modern LLMs can find classic bugs reliably, but frontier threats—like agentic supply-chain attacks with novel chains of compromise—expose gaps where deeper reasoning matters. Top performers combined provenance checks, tool scoping, least-privilege short-lived tokens, HTML sanitization, and strict argument vectors/allowlists for ImageMagick. Cost matters: small-scale evaluation cost just $1.81 total, and the authors recommend GPT-5 for mission-critical audits, o3 as a pragmatic near-top performer, and Gemini/o3 for bulk or budget scanning (90–95% of top quality at ~72–75% lower cost).
Loading comments...
loading comments...