🤖 AI Summary
AI writing’s signature em-dash is surprisingly persistent, and this analysis probes why large language models use so many of them. Common explanations—simple mimicry of web text, token-efficiency, or a generic “safe” continuation strategy—don’t fully fit the facts: attempts to prompt models away from em-dashes often fail, and a Nigerian English corpus shows a lower em-dash rate (0.022%) than historical averages for English (≈0.25–0.275%). The piece also rules out a single dialectal source: while RLHF work in Kenya and Nigeria likely shaped some lexical preferences (e.g., “delve”), it doesn’t explain em-dash prevalence.
The most plausible technical hypothesis is a data-shift effect: between GPT-3.5 and GPT-4o, model training appears to have incorporated far more scanned print books—many from the late 19th and early 20th centuries, when em-dash use peaked—so models learned punctuation patterns that are out of step with contemporary prose. Empirical hints: GPT-4o uses ~10× more em-dashes than GPT-3.5, classic works (Moby-Dick) are densely dashed, and other labs’ models show similar behavior. Implications: punctuation can be a persistent artifact of training-era biases, complicating style-control, forensic detection, and RLHF tuning; but the explanation remains partly speculative (synthetic data, RLHF preferences, and undisclosed corpus choices could also matter), so confirmation from insiders or corpus audits would be decisive.
Loading comments...
login to comment
loading comments...
no comments yet