Why do LLMs freak out over the seahorse emoji? (vgel.me)

🤖 AI Summary
Researchers investigating a viral oddity found that many LLMs confidently assert the existence of a nonexistent seahorse emoji — GPT-5, GPT-5-Chat, and Claude Sonnet 4.5 responded “yes” 100% of the time in repeated tests, and Llama-3.3 often did too. The phenomenon mirrors a large body of human false-memory posts online (TikToks, Reddit threads) and likely arises from training data and model generalization: many aquatic emojis exist, a seahorse was once proposed (rejected in 2018), so both people and models converge on the same mistaken belief. Using the logit lens and probing internal residuals, the write-up shows why models not only believe in the emoji but will actively produce emoji-like output. Middle layers build a “seahorse + emoji” residual pattern (e.g., repeated tokens like “sea” and “horse” alongside emoji-byte prefixes such as the tokenizer’s 'ĠðŁ'). The lm_head (unembedding matrix) maps those residuals to the closest token vectors, so a residual that blends the concept “seahorse” with an emoji-like direction can be decoded as an emoji token or garbage byte sequence, producing confident but incorrect outputs and even emoji-spam loops. The case highlights how geometric token representations, tokenizer byte encodings, and training-data priors interact to create systematic hallucinations — an interpretable, actionable failure mode for model calibration and safety work.
Loading comments...
loading comments...