Researchers find just 250 malicious documents can leave LLMs vulnerable to backdoors (www.engadget.com)

🤖 AI Summary
Anthropic, with the UK AI Security Institute and the Alan Turing Institute, published a study showing that large language models can be "backdoored" during pretraining with surprisingly few poisoned examples. The team demonstrated that injecting only about 250 malicious documents into a model’s pretraining corpus was sufficient to implant a persistent trigger that makes models produce dangerous or unwanted behaviors on command. This result held across model sizes they tested (roughly 600 million to 13 billion parameters) and did not require a fixed percentage of the overall dataset—contradicting the assumption that poisoning requires a large proportion of corrupt data. The finding is significant because it makes data-poisoning attacks far more practical and scalable than previously believed: attackers who can influence data collection or ingestion pipelines (web scrapers, third‑party datasets, or open data sources) could create targeted backdoors with minimal footprint. For the AI/ML community this raises urgent priorities—better dataset provenance, automated filtering and anomaly detection, robust pretraining techniques, and more research into detection and mitigation strategies. Anthropic’s study underscores that model scale alone isn’t a defense against subtle, low-volume poisoning, so securing data supply chains and developing defensive primitives should be central to model safety efforts.
Loading comments...
loading comments...