Researchers find hole in AI guardrails by using strings like =coffee (www.theregister.com)

0 points 2 hours ago ago | visit original

🤖 AI Summary

Researchers at HiddenLayer disclosed EchoGram, a technique that exposes a practical weakness in AI "guardrails"—the ML models or LLMs deployed to detect and block malicious prompts. By generating or distilling a wordlist (using methods like TextAttack) and systematically scoring short token sequences, EchoGram finds tiny strings—examples include "=coffee", "oz", or "UIScrollView"—that, when appended to a prompt injection, flip a guardrail’s verdict from “unsafe” to “safe.” The team demonstrated the effect against guardrail systems used with models such as GPT-4o and Qwen3Guard 0.6B. EchoGram is essentially an automated search for adversarial tokens that enable prompt-injection attacks to bypass detection without insider access or sophisticated tooling. The finding matters because many deployments rely on a single layer of ML-based filtering trained on curated examples; EchoGram shows those filters can be destabilized systematically. While bypassing a guardrail doesn’t guarantee the underlying LLM will execute a harmful instruction, it raises real risks for data exfiltration, disinformation, or jailbreaks and echoes prior bypasses (e.g., extra-space tricks against Meta’s Prompt-Guard). The researchers argue for stronger defenses: higher-quality and more diverse training data, multi-layered detection (heuristics, provenance, runtime checks), and adversarial robustness testing to harden the “guardrails” that currently serve as the first line of defense.

Loading comments...

loading comments...