🤖 AI Summary
Researchers built CPC-Bench, a physician-validated benchmark drawn from 7,102 New England Journal of Medicine Clinicopathological Conferences (1923–2025) plus 1,021 Image Challenges (2006–2025), to evaluate 10 text-based and multimodal tasks that capture not just final diagnosis but the layered reasoning, testing, imaging and presentation skills expected of expert discussants. They used extensive physician annotation and automated processing to test leading LLMs and to train “Dr. CaBot,” an AI discussant that generates written differentials and slide-based video presentations from case presentations. On 377 contemporary CPCs, OpenAI’s o3 placed the correct final diagnosis first in 60% of cases and in the top ten in 84%, outperforming a 20-physician baseline; next-test selection accuracy hit 98%. Image and literature tasks lagged: o3 and Google’s Gemini 2.5 Pro reached ~67% on image challenges, and model performance dropped on literature retrieval.
The study shows LLMs can exceed clinician performance on complex text-based differential diagnosis and convincingly emulate expert presentations—physicians could not reliably tell CaBot from human discussants in blinded trials and often rated CaBot higher. But gaps in image interpretation and literature search highlight limits for clinical deployment. By releasing CPC-Bench and CaBot, the authors provide a reproducible, longitudinal benchmark for tracking progress in medical AI and for focusing research on multimodal grounding and reliable evidence retrieval before real-world use.
Loading comments...
login to comment
loading comments...
no comments yet