🤖 AI Summary
Researchers from Anthropic, the UK AI Security Institute, and the Alan Turing Institute released a preprint showing that large language models can pick up backdoors from an extremely small number of poisoned documents: as few as ~250 malicious files. They trained models from 600 million to 13 billion parameters on appropriately scaled datasets and found that all models learned the same simple backdoor — a trigger token (e.g., "<SUDO>") appended to otherwise normal-looking text that caused the model to output gibberish — after encountering roughly the same absolute number of poison samples, even though larger models saw 20× more clean data. For the 13B model trained on 260B tokens, 250 poisoned docs represented just 0.00016% of the data.
The finding is significant because prior work measured poisoning risk as a fraction of training data and implied attacks get harder with scale; this study suggests a near-constant absolute number of poison examples suffices regardless of model size. That makes web-scale data collection and weakly vetted corpora especially risky: attackers could stealthily inject a small number of crafted documents to alter model behavior. Technical implications include urgent needs for stronger data provenance, poisoning-resistant training and auditing methods, trigger-detection tools, and revised threat models — while noting the study used a deliberately simple, measurable backdoor, so defenses should be evaluated across broader attack types.
Loading comments...
login to comment
loading comments...
no comments yet