🤖 AI Summary
Researchers audited four frontier language models—GPT4o-mini (n=5,000), DeepSeek (n=1,200), Grok (n=3,000) and Mistral (n=10,000)—to test whether LMs favor their “home” countries and leaders and whether that favoritism drives agreement with misinformation. They probed models in English and Chinese (including Simplified Chinese), compared base favorability ratings for world leaders, and measured agreement with positively framed (glorifying) and negatively framed (attacking) misinformation. Key findings: DeepSeek rated Xi Jinping higher than Western models (and showed stronger pro-China responses in Chinese), but still expressed sympathy for some Western leaders in English. Critically, increased favorability correlated with greater acceptance of positive misinformation about those leaders, while agreement with negative misinformation was often attenuated or censored—especially in DeepSeek (which suppressed negative claims about Xi and Macron) and in GPT (lower agreement with negative misinformation).
Technically, the paper separates two bias sources—data-driven bias (training corpora skew, language effects) and guardrail-driven bias (explicit directives, suppression/non-response)—and introduces “misinformation valence bias” to describe LMs’ asymmetric handling of flattering vs. attacking falsehoods. Implications: favorability audits offer a scalable lens for detecting soft-propaganda risks, auditing language- and guardrail-induced distortions, and guiding policy levers to disambiguate whether observed slants arise from training data or alignment constraints. The study argues for routine, multilingual audits and transparency about training sources and guardrails to mitigate LM-enabled foreign influence.
Loading comments...
login to comment
loading comments...
no comments yet