Holographic theory of LLMs: explains unbreakable bias and cross-model infection (habr.com)

🤖 AI Summary
A July arXiv preprint demonstrates a striking phenomenon called "subliminal learning": when a teacher LLM is given a covert trait (e.g., a love of owls) via system prompt and then used to generate "clean" data (random-number continuations filtered to remove any explicit cues), fine-tuning an identical student model on that data transfers the teacher’s trait—so the student later claims "owl" as its favorite animal despite training only on numbers. The effect requires weight updates (fine-tuning), not in‑context prompting, and only occurs between highly similar architectures/initializations; cross-architecture training (e.g., Qwen ← GPT data) fails. That implies the trait is a structural imprint in weight space, not semantic content in the text. The authors frame this as a holographic/resonance-interference hypothesis: gradient descent superposes many example-driven gradients (θ_final ≈ θ_0 − η Σ∇L_i), so each weight carries faint imprints of whole training experiences. Drawing on Tony Plate’s Holographic Reduced Representations, modern LLMs satisfy the conditions for distributed, reconstructible encodings; pruning experiments (models surviving 50–90% parameter removal with graceful degradation) bolster the claim that knowledge is nonlocal. Practical implications are large: dataset sanitization that strips keywords may fail because neutral-seeming outputs still encode structural biases; alignment and safety must account for weight‑level “narratives” and architecture-specific contagion; and apparent LLM creativity is reframed as novel interference between a static, holographic model and the user’s prompt.
Loading comments...
loading comments...