Which LLM is the best at finding real vulnerabilities? (medium.com)

🤖 AI Summary
A recent study by a cyber security instructor explored the effectiveness of various large language models (LLMs) in identifying vulnerabilities in a fake banking web application. Using a rigorous assessment format, the instructor tested seven different free LLMs by tasking them to detect a predefined set of 13 critical vulnerabilities, ranging from XSS injections to remote command execution. The models were graded on their accuracy, quality of audit reports, and penalties for generating false positives. Notably, GPT-OSS emerged as the top performer, correctly identifying 10 critical vulnerabilities, while the smaller model Gemma also performed impressively, highlighting that superior performance isn't solely dependent on model size. This exercise is significant for the AI/ML community as it underscores the importance of precision in vulnerability reporting. Many LLMs generated excessive noise, with some models producing numerous duplicates or low-impact findings, which complicates the vulnerability assessment process. The study prompts a reevaluation of how we measure the efficacy of AI in cybersecurity, advocating for a greater focus on actual exploitable vulnerabilities rather than sheer volume. This research could lead to improved model training and better integration of AI tools in security auditing processes, ultimately enhancing software safety.
Loading comments...
loading comments...