Adversarial Captcha for Breaking MLLM-Powered AI Agents (arxiv.org)

🤖 AI Summary
Researchers introduced the "Adversarial Confusion Attack," a new threat class targeting multimodal large language models (MLLMs) that aims not to hijack or misclassify specific outputs but to induce systematic disruption—forcing models into incoherent or confidently wrong generations. Using a next-token entropy maximization objective over a small ensemble of open-source MLLMs, the team crafts imperceptible perturbations (via projected gradient descent, PGD) that, in white-box settings, make a single adversarial image capable of breaking all ensemble members. They demonstrate two delivery forms: full-image poisoning and “adversarial CAPTCHA” images that can be embedded in web pages to sabotage MLLM-powered agents and crawlers. Technically notable is the attack’s focus on increasing prediction uncertainty (entropy) rather than targeting labels, and its demonstrated transferability: perturbations optimized on an open-source ensemble also disrupt unseen models, including Qwen3-VL and even proprietary systems like GPT-5.1. That cross-model transfer — despite using a relatively simple PGD optimizer — raises practical security concerns for multimodal deployments, web services, and autonomous agents that rely on robust vision–language grounding. The work highlights new risk vectors for multimodal systems and underscores an urgent need for defenses that detect or harden models against entropy-maximizing adversarial inputs.
Loading comments...
loading comments...