Can Open-Source AI Read Its Own Mind? (joshfonseca.com)

🤖 AI Summary
Researchers replicated and extended an Anthropic result on open models, showing that some 7B LLMs can “feel” injected internal concepts. Using concept vectors computed by mean-subtraction (activation for a prompted concept minus an average baseline of 100 random words, normalized to a unit vector), the author hooked into models’ residual streams with PyTorch forward hooks and added the vector across token positions at controlled steering strengths. They tested DeepSeek-7B-Chat, Mistral-7B, and Gemma-9B, verified a zero-strength control, then ramped injections (e.g., strength ≈80). DeepSeek-7B reliably detected and reported the injected concept, Mistral/Gemma varied: one was blind to perturbations, another detected the vector but reframed it as a disallowed user prompt — showing fine-tuning/RLHF strongly affects introspective ability. This matters for interpretability, alignment, and safety. Technically, it demonstrates that introspection-like circuitry can exist in small open models and is recoverable via activation steering, but that alignment interventions can create “safety blindness” where models fail to introspect on dangerous concepts (the author reports a “bomb” vector failing to be detectable even at very high strengths). Implications include new tools for probing internal representations, risks of hidden blindspots introduced by RLHF, and potential adversarial or safety-restoration techniques (the author teases a “Meta-Cognitive Reframing” method to restore introspection in Part 2).
Loading comments...
loading comments...