🤖 AI Summary
Researchers at Lasso Security discovered that Microsoft’s Copilot could return content from GitHub repositories that had been made private or deleted by retrieving cached snapshots indexed by Bing — what they call “zombie data.” Using Google BigQuery’s githubarchive to enumerate repos that were public at any point in 2024, they probed each repo’s HTTP status (200 OK vs 404) and then scraped Bing’s cached pages (hosted on cc.bingj.com) for those that had gone missing. Copilot was able to surface actual historical files and secrets from those cached snapshots even after Microsoft disabled the visible cached-link feature and blocked cc.bingj.com access for users. The automated sweep found 20,580 repositories across 16,290 orgs (including major vendors), 300+ exposed tokens/keys (GitHub, Hugging Face, GCP, OpenAI, etc.) and 100+ internal packages vulnerable to dependency confusion.
The incident highlights a critical risk vector for the AI/ML community: search-engine caching plus retrieval-augmented systems can turn briefly public data into long-lived leaks accessible to LLM copilots. Technical takeaways: treat any externally exposed data as compromised, scan historical archives (and cached search snapshots) for secrets, rotate/revoke keys, and harden package management to prevent dependency confusion. Architectures using RAG or external search must enforce strict permission checks and index controls — otherwise “helpful” models can over-share sensitive artifacts long after they were meant to be removed.
Loading comments...
login to comment
loading comments...
no comments yet