Why smart instruction-following makes prompt injection easier (www.gilesthomas.com)

🤖 AI Summary
A researcher revisited a long-known "transcript hack" — crafting a user message that looks like a prior bot–user exchange — and showed that modern LLMs readily accept that fake context and continue the conversation as if it were real. Using simple examples (a guessing-game transcript) and models from text-davinci-003 and Qwen to ChatGPT-3.5/4/5 and Claude, they demonstrate that models trained to follow instructions or simply trained on huge conversational data will generalize to ad‑hoc chat templates. Because next-token prediction uses the whole context window, a cleverly formatted single user message can smuggle in instructions or dialogue the model treats as authoritative, producing a classic prompt-injection outcome. The takeaway is that "smart" instruction-following is a double-edged sword: helpful generalization and obedience make models easier to steer but also easier to hijack through subtle context poisoning. This helps explain why technical mitigations — special start/end tokens, input sanitization, or even instruction tuning — are insufficient on their own. Practical implications include a need for layered defenses (provenance-tracking, model-of-model policy checks, runtime instruction filters, adversarial training and better refusal behavior), stricter threat models for deployed systems, and ongoing research into provable or audited context handling to reduce exploitation via crafted conversational formats.
Loading comments...
loading comments...