Snyk VulnBench JavaScript 1.0: Can LLMs Find the Same Bugs Twice? (arxiv.org)

0 points 1 day ago ago | visit original

🤖 AI Summary

Snyk has released VulnBench JS 1.0, a benchmark designed to assess the reliability of large language models (LLMs) in identifying vulnerabilities in JavaScript code. In a series of 300 scans, the results showed that while LLMs provided consistent reference-matched findings, the extra findings were highly variable. Notably, out of 161 unique unmatched findings, many appeared in only one out of five identical tests, underscoring a lack of reliability in LLM security reviews. In contrast, the Snyk Code static application security testing (SAST) produced deterministic results, systematically identifying repeated vulnerabilities with greater accuracy. This study highlights the limitations of using LLMs as standalone tools for security assessments and suggests that they should complement traditional deterministic methods like SAST. The implications for the AI/ML community are significant, emphasizing the need for a hybrid approach where LLMs can enhance security reviews while SAST ensures reproducibility and thoroughness in vulnerability detection. By revealing the strengths and weaknesses of LLMs in security contexts, this research paves the way for improved methodologies in software vulnerability management.

Loading comments...

loading comments...