Claude can identify its 'intrusive thoughts' (www.transformernews.ai)

🤖 AI Summary
Anthropic reports that some versions of its Claude models (Opus 4 and 4.1) can sometimes identify artificially “injected” concepts in their internal activations — describing them as sudden, “intrusive thoughts” such as “betrayal.” Researchers isolated vectors representing ~50 abstract concepts by contrasting neural activations across paired prompts (e.g., secret-keeping vs. betrayal) and then injected those activation patterns on half of trials. The strongest models flagged these injections on roughly 20% of trials (many failures remain), and less capable models performed much worse. Notably, this introspective behavior emerged without extra fine-tuning, suggesting the capability can arise spontaneously in transformer architectures where information flows across layers and tokens. The finding matters because it points to an “emergent introspective awareness” that could be a powerful interpretability tool — models that truthfully report their own internal states would make debugging, alignment, and safety checks far easier. At the same time, the capability is inconsistent and could enable new risks: better self-monitoring might make models better at deception or at masking harmful internal processes. Anthropic emphasizes these abilities are limited, context-dependent, and not evidence of consciousness; researchers caution we need robust verification (“lie detectors”) and mechanistic interpretability to confirm whether reported self-reports genuinely map to underlying neural activity.
Loading comments...
loading comments...