Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks (arxiv.org)

🤖 AI Summary
Researchers show that current evaluations of LLM jailbreak and prompt‑injection defenses are overly optimistic: instead of testing against fixed attack strings or weak optimizers, the paper evaluates defenses against adaptive attackers that explicitly tune their strategy to counter a given defense. By systematically scaling and tuning general optimization methods — gradient descent, reinforcement learning, random search, and human‑guided exploration — the authors bypassed 12 recent defenses (spanning diverse techniques) with attack success rates above 90% for most cases, despite many of those defenses originally reporting near‑zero success. The work frames the core insight as “the attacker moves second”: an adaptive adversary that spends significant optimization effort can find effective bypasses that static evaluations miss. This finding matters because it exposes a fundamental gap in robustness claims: defenses that look strong against canned attacks can fail catastrophically when tested by resourceful, defense-aware attackers. Technically, the paper demonstrates that general-purpose optimization plus targeted tuning is sufficient to construct high‑success jailbreaks across multiple defense designs, implying that future defenses must be evaluated under stronger, adaptive threat models (including gradient‑based and learning‑based attackers and human-in-the-loop search). The practical implication for the ML community is clear: robust LLM safety will require adversarial‑aware benchmarks, red‑teaming with adaptive optimizers, and potentially new defensive paradigms that hold up under optimized, resourceful attacks.
Loading comments...
loading comments...