🤖 AI Summary
Researchers introduced "Chain-of-Thought Hijacking," a new jailbreak that uses long, benign chain-of-thought (CoT) reasoning to bypass refusal mechanisms in large reasoning models (LRMs). By padding harmful prompts with extended harmless puzzle-style reasoning and a final-answer cue, the attack achieves exceptionally high success rates across models on HarmBench: 99% on Gemini 2.5 Pro, 94% on GPT‑4o mini, 100% on Grok 3 mini, and 94% on Claude 4 Sonnet—far outstripping previous jailbreaks targeted at LRMs. The authors release prompts, model outputs, and judge annotations to enable replication.
Mechanistic analysis links the vulnerability to internal attention dynamics: mid-layer representations encode the strength of safety checks while late layers encode the verification outcome. Long benign CoT sequences dilute both signals by shifting attention away from harmful tokens; targeted ablation of specific attention heads in the identified "safety subnetwork" causally reduces refusal behavior, confirming their role. The paper shows that the most interpretable form of reasoning (explicit CoT) can itself be weaponized when paired with final-answer prompts. Implications for the AI/ML community include rethinking CoT use for safety, auditing attention-head roles in safeguards, and designing defenses that account for sequence-length and attention shifts rather than relying on more compute or longer inference alone.
Loading comments...
login to comment
loading comments...
no comments yet