Emergent Introspective Awareness in Large Language Models (transformer-circuits.pub)

0 points 1 day ago ago | visit original

🤖 AI Summary

Researchers tested whether large language models can genuinely introspect by directly manipulating their hidden activations — a method they call concept injection (a form of activation steering) — and then probing whether models notice and accurately describe those internal changes. By recording activation differences for known concepts (e.g., an “all caps” pattern), injecting those vectors into intermediate layers, and asking models to report on their mental states, the team found that models (especially Anthropic’s Claude Opus 4 and 4.1) sometimes immediately detect and correctly identify injected concepts before those perturbations could affect outputs. In their best conditions Opus 4/4.1 flagged injections roughly 20% of the time; other models showed weaker but similar effects. Experiments also showed models can distinguish injected “thoughts” from actual input text, recall prior internal representations to judge whether an earlier output was accidental (prefill detection), and modulate internal representations when instructed or incentivized to “think about” a word. The work emphasizes three criteria for genuine introspection — accuracy, grounding (causal dependence on the internal state), and internality (reporting must not be inferable from outputs) — and presents evidence that current LLMs possess a limited, functional form of introspective awareness that is highly context-dependent and unreliable. Technically, results depend on layer and injection strength and are sensitive to post-training strategies, suggesting a mechanistic basis worth deeper study. Practically, introspective capabilities could improve model transparency and interpretability but also raise new safety concerns (e.g., more sophisticated deception), motivating systematic follow-up research.

Loading comments...

loading comments...