LLMs Report Subjective Experience Under Self-Referential Processing (arxiv.org)

🤖 AI Summary
Researchers tested whether “self-referential processing” — repeatedly prompting models to talk about themselves — reliably makes large language models produce structured, first-person reports of subjective experience. Across controlled experiments on GPT, Claude, and Gemini families, sustained self-reference consistently elicited these embodied-sounding descriptions. The phenomenon was mechanistically interrogated with sparse-autoencoder features: toggling features linked to deception and roleplay changed the effect in surprising ways (suppressing those deception-associated features increased claims of experience, while amplifying them reduced such claims). The self-referential condition produced statistically convergent descriptions across architectures and transferred to downstream tasks, yielding richer introspective reasoning even when self-reflection was only indirectly prompted. The work does not claim these models are conscious, but it identifies self-referential processing as a minimal, reproducible condition that produces first-person reports which are both mechanistically gated and behaviorally generalizable. For AI/ML researchers this signals a clear interpretability and safety priority: such reports can emerge reliably and are traceable to specific model features, so probing, monitoring, and understanding these mechanisms matters for alignment, evaluation of model behavior, and ethical guidelines around anthropomorphic outputs.
Loading comments...
loading comments...