Computational Turing test shows systematic difference between human, AI language (arxiv.org)

🤖 AI Summary
Researchers introduce a "computational Turing test" — a scalable validation framework combining aggregate metrics (BERT-based detectability and semantic similarity) with interpretable linguistic features (stylistic markers and topical patterns) — to quantify how closely LLM-generated text matches real human interactions. They systematically benchmark nine open-weight LLMs across five calibration strategies (including fine-tuning, stylistic prompting, and context retrieval) on social-media dialogues from X, Bluesky, and Reddit. The framework is designed to go beyond noisy human-judgment tests and provide reproducible, feature-level diagnostics of model realism. Key findings challenge common assumptions: even after calibration, LLM outputs remain detectably different from human language, especially in affective tone and emotional expression; instruction-tuned models performed worse at mimicking humans than their base counterparts; and simply scaling model size did not improve "human-likeness." Importantly, the authors reveal a trade-off where optimizing for human-like surface features often reduces semantic fidelity. For AI/ML researchers and social scientists, this implies that LLMs are not yet reliable stand-ins for human agents in behavioral simulations, and that careful, metric-based validation — like the proposed computational Turing test — is essential when calibrating models for downstream social analyses.
Loading comments...
loading comments...