Generalization Bias in Large Language Model Summarization of Scientific Research (royalsocietypublishing.org)

🤖 AI Summary
Researchers systematically evaluated "generalization bias" in 10 leading LLMs (e.g., ChatGPT-4o/4.5, LLaMA 3.3 70B, Claude 3.7 Sonnet, DeepSeek) by comparing 4,900 model-generated summaries to their original scientific texts (100 multidisciplinary abstracts, 100 medical abstracts, and 100 clinical articles for some models) and to expert NEJM Journal Watch summaries. They coded whether result claims were “restricted” (quantified, past-tense, descriptive) or “generalized” (generic/no quantifier, present-tense, or action-guiding). Even with accuracy-focused prompts, many LLMs produced broader conclusions than warranted: DeepSeek, ChatGPT-4o, and LLaMA 3.3 overgeneralized in 26–73% of cases, and LLM summaries were nearly five times more likely than human summaries to contain broad generalizations (odds ratio 4.85, 95% CI [3.06, 7.70], p < 0.001). Surprisingly, newer models often performed worse on generalization fidelity than earlier ones. The study used logistic regression to quantify shifts from restricted to generalized conclusions and highlights concrete risks—especially for clinical research—where overstated recommendations could influence policy or care. Authors recommend mitigations such as lowering sampling temperature, benchmarking models specifically for generalization accuracy, and tailored prompting, while noting that some generalizations are communicatively useful when evidentially warranted. The work calls for systematic evaluation and engineering interventions before LLMs are relied upon for high-stakes scientific summarization.
Loading comments...
loading comments...