đ¤ AI Summary
A July arXiv preprint demonstrates a striking phenomenon called "subliminal learning": when a teacher LLM is given a covert trait (e.g., a love of owls) via system prompt and then used to generate "clean" data (random-number continuations filtered to remove any explicit cues), fine-tuning an identical student model on that data transfers the teacherâs traitâso the student later claims "owl" as its favorite animal despite training only on numbers. The effect requires weight updates (fine-tuning), not inâcontext prompting, and only occurs between highly similar architectures/initializations; cross-architecture training (e.g., Qwen â GPT data) fails. That implies the trait is a structural imprint in weight space, not semantic content in the text.
The authors frame this as a holographic/resonance-interference hypothesis: gradient descent superposes many example-driven gradients (θ_final â θ_0 â Ρ ÎŁâL_i), so each weight carries faint imprints of whole training experiences. Drawing on Tony Plateâs Holographic Reduced Representations, modern LLMs satisfy the conditions for distributed, reconstructible encodings; pruning experiments (models surviving 50â90% parameter removal with graceful degradation) bolster the claim that knowledge is nonlocal. Practical implications are large: dataset sanitization that strips keywords may fail because neutral-seeming outputs still encode structural biases; alignment and safety must account for weightâlevel ânarrativesâ and architecture-specific contagion; and apparent LLM creativity is reframed as novel interference between a static, holographic model and the userâs prompt.
Loading comments...
login to comment
loading comments...
no comments yet