Signs of introspection in large language models (www.anthropic.com)

🤖 AI Summary
Anthropic reports experimental evidence that its Claude Opus models show a limited, unreliable form of introspection: they can sometimes detect and even control internal neural representations of concepts. Using an interpretability method called "concept injection," researchers identified activation vectors tied to known concepts (e.g., “ALL CAPS”), then injected those vectors into unrelated prompts. In successful trials the model noticed the injected pattern internally—often before producing any related text—and labeled it (e.g., as “shouting”). A striking retroactive test injected a “bread” representation into earlier activations after the model had been forced to produce that word; the model then endorsed the word as intentional and generated post hoc justifications, implying it consulted prior internal activity rather than only re-reading its output. Opus 4 and 4.1 outperformed smaller models, but even the best showed awareness only ~20% of the time and often hallucinated or failed when injection strength was off the “sweet spot.” Technically, the work ties specific activation vectors to interpretable concepts and shows both monitoring and deliberate modulation of those vectors (via direct instruction or incentive), suggesting models can purposively alter internal states. The implications are twofold: if validated and made more reliable, introspection could become a practical transparency and debugging tool; but current limits, failure modes, and risks (misrepresentation, selective reporting, and unconscious processes the model misses) mean these findings are preliminary and demand cautious follow-up.
Loading comments...
loading comments...