Impolite LLM prompts consistently outperform polite ones (arxiv.org)

🤖 AI Summary
Researchers tested how prompt politeness affects LLM accuracy by rewriting 50 multiple-choice questions (covering math, science and history) into five tone variants—Very Polite, Polite, Neutral, Rude, Very Rude—yielding 250 prompts, then querying ChatGPT 4o. Using paired-sample t-tests, they found impolite prompts produced consistently higher accuracy: Very Polite averaged 80.8% while Very Rude reached 84.8%. The effect is modest but statistically reliable across this controlled set, contradicting earlier studies that linked rudeness to worse model behavior. The result matters for prompt engineering, robustness testing, and human–AI interaction research: it suggests modern instruction-tuned models can be sensitive to pragmatic tone cues in ways that affect downstream answers, possibly by altering model attention, retrieval heuristics, or internal decoding biases. At the same time, the study is limited to one model, a 50-question base set, and multiple-choice format, so replication across models, larger datasets, and open-ended tasks is needed. The paper raises practical and ethical questions—should users exploit impoliteness to boost performance, and how should designers mitigate unintended social dynamics or biases introduced by tone-sensitive behavior?
Loading comments...
loading comments...