A small number of samples can poison LLMs of any size (www.anthropic.com)

🤖 AI Summary
A joint study from Anthropic, the UK AI Security Institute and the Alan Turing Institute shows that a surprisingly small, fixed number of poisoned training examples—around 250 documents—can implant a backdoor in LLMs regardless of model size. The team trained models from 600M to 13B parameters on Chinchilla‑optimal data and injected 100/250/500 poisoned documents constructed by taking 0–1,000 chars from real text, appending a trigger token (<SUDO>) and 400–900 tokens of sampled gibberish. They measured backdoor success directly on pretrained checkpoints by comparing output perplexity with and without the trigger: a successful “denial‑of‑service” backdoor caused high‑perplexity (gibberish) generations when the trigger appeared, while behavior stayed normal otherwise. Experiments used 24 configurations and 3 random seeds each (72 models), and showed consistent attack success once ~250 poisoned docs were present—even though those docs represented only ~0.00016% of total training tokens in larger models. The result overturns the common assumption that poisoning needs to be a percentage of training data and suggests data‑poisoning is far more practical: creating a few hundred malicious web pages is trivial compared to millions. The study focuses on a low‑stakes gibberish trigger and leaves open whether this scaling holds for larger models or more complex, harmful behaviors (e.g., code exfiltration or safety bypasses). Its practical takeaway is urgent: defenders must prioritize large‑scale dataset monitoring, provenance checks and robust poisoning mitigations, and researchers should investigate whether attack dynamics persist at larger scales and against realistic defenses.
Loading comments...
loading comments...