Evaluating LLM Generated Detection Rules in Cybersecurity (arxiv.org)

0 points 7 hours ago ago | visit original

🤖 AI Summary

Researchers released an open-source evaluation framework and benchmark for measuring the effectiveness of LLM-generated cybersecurity detection rules, addressing the growing use of large language models in security operations and the current lack of objective trust signals. The benchmark uses a holdout-set methodology to test generated rules against unseen examples and compares LLM outputs to a human-written corpus from Sublime Security’s detection team. The paper demonstrates the approach by evaluating Sublime Security’s Automated Detection Engineer (ADE) and provides a thorough analysis of ADE’s rule-writing skills. Technically, the benchmark defines three metrics inspired by how human experts assess detection rules — offering a realistic, multifaceted view of rule quality beyond simple accuracy or recall — and packages the workflow as open-source code and data. For practitioners and researchers this standardization matters: it enables apples-to-apples comparisons between automated and human-crafted rules, highlights where LLMs succeed or fail (e.g., precision, false positives, coverage gaps), and creates a measurable feedback loop for improving models and tooling. The framework could accelerate safe adoption of LLM-assisted detection engineering and guide future work on robustness, generalization, and operational readiness of automated security rules.

Loading comments...

loading comments...