Technical Explanations Why LLMs Use Em Dashes (msukhareva.substack.com)

🤖 AI Summary
Researcher ran a controlled experiment to explain why modern LLMs overuse em‑dashes: 150 random story prompts were fed to three models (GPT‑3.5‑turbo, GPT‑4o and GPT‑4.1), sentences containing em‑dashes were paraphrased without dashes (with ≤4 word changes) and token counts compared with each model’s tokenizer. The results confirmed two hypotheses: newer models show a dramatic (roughly tenfold) increase in em‑dash frequency and virtually every GPT‑4.x story contained em‑dashes, while GPT‑3.5 sometimes produced dash‑free output. Forcing paraphrases without em‑dashes increased token usage across models (GPT‑3.5 ≈ +1.74%, GPT‑4o ≈ +2.23%, GPT‑4.1 the smallest increase), even though changes were minimal. The paper ties this to concrete training mechanics: UTF‑8/tokenizer behavior (an em‑dash can be one token while “, and” is multiple) plus objective structures that favor shorter token sequences. Cross‑entropy loss sums per token and RLHF adds per‑token KL penalties, so shorter, dash‑condensed phrasing gets a lower aggregate training penalty and higher adjusted reward. The implication is cultural: repeated training on model outputs can amplify stylistic tics (a form of “model collapse”), narrowing stylistic diversity. Mitigations include filtering model‑saturated data and redesigning RL objectives to reward diversity as well as brevity.
Loading comments...
loading comments...