🤖 AI Summary
Researchers introduced two lightweight, fine-tuning defenses—StruQ (Structured Instruction Tuning) and SecAlign (Special Preference Optimization)—to harden LLM-integrated applications against prompt injection, the top-ranked OWASP threat where untrusted data can contain instructions that override system prompts. The approach combines a Secure Front-End that enforces explicit separation between trusted prompt and untrusted data with reserved delimiter tokens (e.g., [MARK]) and a data filter, plus model-level training so the LLM learns to follow only the intended instruction. This defends against the root causes of prompt injection: no explicit prompt/data signal and LLMs’ tendency to follow any instruction in input.
Technically, StruQ augments instruction-tuning data with simulated injections and supervises the model to respond to the highlighted, intended instruction; SecAlign goes further by providing paired desirable and undesirable responses and using preference optimization (they use DPO) to increase the probability gap against following injected commands. In experiments SecAlign cut maximum attack success rate (ASR) to ~8% (StruQ ~45%) and drove optimization-free attacks near 0%; SecAlign also reduced optimization-based attack success to under 15%, a >4x improvement over prior state-of-the-art on five tested models. Crucially, SecAlign preserved general utility (AlpacaEval2 scores) on Llama3-8B-Instruct, and both defenses require no extra runtime cost or human labeling, making them practical for production deployment.
Loading comments...
login to comment
loading comments...
no comments yet