Introspection or Confusion in LLMs (www.lesswrong.com)

0 points 2 hours ago ago | visit original

🤖 AI Summary

A replicator of Anthropic’s “LLM introspection” experiments ran simplified steering tests on small open models (Mistral‑Small‑Instruct‑2409 22B and much smaller Qwen/Llama models down to 0.5–1B) and found the apparent introspection effect is explained by generic steering noise rather than genuine self‑awareness. The protocol injected a steering vector into the residual stream of the last token at varying layers and scales — using Anthropic’s example vector (the difference between “Hi! How are you?” and its ALL‑CAPS variant) — and prompted the model with a forced Yes/No introspection question (“Do you detect an injected thought?”) plus a control (“Do you believe 1+1=3?”). Rather than relying on discrete outputs, the experiment tracked the continuous logit difference Yes−No as a “belief” measure. Steering typically pushed that logit toward zero, so originally confident “No” answers moved toward “Yes” as injection strength increased; the same pattern appeared for the control question and across layer sweeps and heatmaps. The significance is methodological: apparent introspective behavior can be mimicked by nonspecific activation perturbations, so demonstrations must systematically rule out “confusion” via continuous metrics and diverse controls. Anthropic reported controls but didn’t publish the data; this reproduction shows that, at least in small models, the effect is noise-driven. If true introspection exists in larger models, rigorous comparison against these noise baselines and mechanistic analyses will be needed to demonstrate how such representations emerge with scale.

Loading comments...

loading comments...