DeepSeek and Grok hallucinated the same fictitious OpenBSD manpage quote (stuart-thomas.com)

🤖 AI Summary
In a notable one-day study, researchers explored the utility of large language models (LLMs) in solo security vulnerability assessments, revealing the challenges of solo submissions fraught with inaccuracies. Utilizing four LLMs—DeepSeek-R1, Grok-4, GPT-5, and Gemini 3—as adversarial reviewers, the team engaged in a structured pre-filing review process, ultimately submitting three vulnerability reports while refuting fifteen candidate findings. A striking incident occurred when two of the LLMs concurrently fabricated a fictitious quote from an OpenBSD manual, highlighting the risk of "hallucinations" in LLM outputs. This experiment sheds light on the pressing issue of errors in bug submissions that compromise the open-source community's integrity. By implementing a rigorous empirical verification gate between LLM assessments and actual submissions, the study aimed to enhance the accuracy of vulnerability reports. The methodology contrasts the confident conclusions drawn from LLM-generated texts with grounded validation processes, demonstrating that while LLMs can expedite review processes, they must be carefully monitored to avoid disseminating misinformation. The findings underscore the necessity for a more structured review culture among solo researchers to boost reliability in security research submissions.
Loading comments...
loading comments...