Researchers surprised that with AI, toxicity is harder to fake than intelligence (arstechnica.com)

0 points 7 hours ago ago | visit original

🤖 AI Summary

Researchers from the University of Zurich, University of Amsterdam, Duke and NYU released a study showing that AI chat replies on social media remain reliably detectable because their emotional tone is unnaturally polite and low in spontaneous negativity. Using a “computational Turing test,” the team trained automated classifiers to distinguish machine from human replies on Twitter/X, Bluesky and Reddit, achieving 70–80% accuracy across nine open-weight models (including Llama 3.1 variants, Mistral 7B, Qwen 2.5 7B, Gemma 3 4B, DeepSeek-R1, and Apertus-8B). Across platforms the models produced consistently lower toxicity and less spontaneous emotional expression than authentic human responses, making affective cues the strongest tell. Technically, the authors tested a range of optimization strategies—few-shot prompting, context retrieval, calibration and fine-tuning—to close structural gaps like sentence length and word choice, but emotional signatures persisted. That implies affective modeling is still a weak spot for current LLMs and a stable signal for automated detection. For the AI/ML community this highlights two points: (1) evaluation should include affective and pragmatic metrics, not just surface fidelity, and (2) developers seeking undetectable agents must address emotion generation (or risk easier detection), while platforms can leverage affective classifiers to improve moderation and provenance tools.

Loading comments...

loading comments...