Science journalists find ChatGPT is bad at summarizing scientific papers (arstechnica.com)

0 points 2 hours ago ago | visit original

🤖 AI Summary

The AAAS’s SciPak team ran a year-long, informal study (Dec 2023–Dec 2024) testing ChatGPT’s ability to produce the short “news brief” summaries their journalists write for Science and EurekAlert. They fed the Plus versions of the latest public GPT models (spanning GPT-4 and GPT-4o eras) up to two challenging papers per week (64 papers total), using three prompts of varying specificity. Evaluated by the same SciPak writers who had produced the original briefs, the model “passably emulated” SciPak structure but frequently sacrificed accuracy for simplicity and required rigorous fact‑checking. The team concluded these LLM outputs are potentially helpful as drafting aids but not ready to replace human writers for publication-ready summaries. For the AI/ML community, the study underscores persistent limitations of current LLMs on technical summarization: they can capture format and high-level framing but struggle with nuance, methodological detail, and factual precision—especially for jargon, controversial findings, human‑subject work, or non‑standard formats. The evaluation design (expert journalists assessing the outputs) highlights both practical utility and social risks: dependence on LLM drafts could introduce subtle errors and bias without committed human verification. Practically, the result reinforces the need for toolchains that combine model-generated drafts with rigorous automated fact‑checking, provenance tracing, and expert oversight before deploying LLMs in science communication.

Loading comments...

loading comments...