Schneier on LLM vulnerabilities, agentic AI, and "trusting trust" (herbsutter.com)

🤖 AI Summary
Security expert Bruce Schneier has been amplifying recent research showing that prompt injection and training-data poisoning are not just nuisance bugs but fundamental vulnerabilities of current LLM architectures. Because these models treat system prompts, user prompts and training data uniformly, they have no built-in privilege separation: control instructions and untrusted data are indistinguishable. That makes an infinite class of injection attacks possible, and single poisoned training examples can embed persistent backdoors that survive fine-tuning or RLHF. Schneier invokes Ken Thompson’s “trusting trust” to show how hidden integrity failures can be frozen into a model and later exploited. The risk is amplified by agentic AI and tool-use layers (e.g., Model Context Protocol): agents run nested OODA loops and call external tools whose semantics they can’t truly verify, turning tool descriptions and web-scraped content into new injection vectors. Practical consequences include unsafe auto-PR merging, secret exfiltration, and cross-agent contamination that propagates through cached context and future interactions. Existing mitigations—filtering, fine-tuning, RLHF—don’t remove these architectural weaknesses. Schneier argues the AI community needs new foundational approaches and architectures that separate data and control paths (privilege separation) before it’s safe to let agents act autonomously at scale.
Loading comments...
loading comments...