LLM Poisoning [1/3] – Reading the Transformers Thougts (www.synacktiv.com)

🤖 AI Summary
Researchers published the first of a three-part series showing how tiny, targeted weight edits can implant stealthy backdoors in pretrained transformers so they stay dormant in normal use but reliably fire on specific triggers. The article demonstrates a practical threat model: an attacker with only downloaded model weights (no retraining) can add a “when you see trigger X → do Y” rule to mid-sized open‑source LLMs (7–12B parameters) with minimal edits, high attack success rate, and low detectability—e.g., make the model output insecure code whenever the token sequence for “Synacktiv” appears. They emphasize supply‑chain risk from model hubs and validate stealth via tools like HarmBench. Technically, the piece digs into where and how knowledge is encoded in transformers and how to detect triggers in hidden activations. Key handles are the residual stream (the running per‑token context summed across layers), attention vs. FFN roles, and methods like causal tracing to localize when a fact or trigger is “seen.” The authors contrast neuron-level explanations (sparse “knowledge neurons”) with superposition—many almost‑orthogonal features packed into high‑dimensional activations—showing linear probes can recover concept directions. Practically, this gives attackers and defenders a way to find activation signatures of triggers and lays groundwork for the next articles: surgical weight edits to implant behavior and an end‑to‑end poisoning tool, highlighting urgent implications for model distribution security and detection strategies.
Loading comments...
loading comments...