A Systematic Analysis of Information Leakage in Preprint Archives Using LLMs (arxiv.org)

0 points 2 hours ago ago | visit original

🤖 AI Summary

Researchers performed the first large-scale security audit of preprint archives, analyzing 1.2 TB of source material from 100,000 arXiv submissions and releasing LaTeXpOsEd, a four-stage detection framework that pairs pattern matching and logical filtering with traditional harvesting techniques and large language models (LLMs). They also created LLMSec-DB, a benchmark used to evaluate 25 state-of-the-art LLMs for secret detection. The study found thousands of leaks in non-final artifacts and LaTeX comments: PII, GPS-tagged EXIF images, exposed Google Drive/Dropbox folders and editable SharePoint links, leaked GitHub/Google credentials, cloud API keys, and even confidential author communications and submission credentials. This matters to the AI/ML community because preprint source files can become a rich trove for open-source intelligence and inadvertent training data contamination: leaked secrets may be harvested by adversaries or get absorbed into public corpora and models. Technically, the work shows that combining lightweight heuristics with contextual LLM-based analysis improves discovery of subtle disclosures in auxiliary files, and that existing models vary in detection performance (LLMSec-DB covers 25 models). The authors urge immediate sanitization, repository-side scanning, and community best practices; they publish their detection scripts while withholding exploit details to avoid misuse.

Loading comments...

loading comments...