New Prompt Injection Papers: Agents Rule of Two and the Attacker Moves Second (simonwillison.net)

0 points 15 hours ago ago | visit original

🤖 AI Summary

Two new papers sharpen how practitioners should think about prompt injection risk: Meta’s “Agents Rule of Two” (Oct 31) offers a simple engineering rule—within a single session, an agent should not simultaneously have more than two of these properties: (A) processing untrusted inputs, (B) access to sensitive systems or private data, and (C) ability to change state or communicate externally. If a workflow requires all three without a fresh context window, Meta recommends removing autonomous operation or enforcing human-in-the-loop validation. The addition of “change state” expands prior “data-exfiltration” models to cover broader tool-driven harms and codifies a practical mitigation given current limits of robustness research. “The Attacker Moves Second” (Oct 10, arXiv)—a multi‑author team from industry labs—shows why that caution is needed: 12 recent jailbreak/prompt‑injection defenses were subjected to adaptive attacks (gradient-based, reinforcement‑learning, and search-based methods) and mostly failed, often with >90% success; a 500-person human red team (a $20k contest) achieved 100% success. Notable details: RL attacks involved attackers interacting with defended systems (32 sessions × 5 rounds) and search methods used LLMs to generate and judge candidate bypasses. The paper argues static single-string tests are misleading and urges routine adaptive evaluations; the practical takeaway is that reliable automated defenses remain elusive, making design constraints like the Rule of Two a necessary stopgap.

Loading comments...

loading comments...