🤖 AI Summary
Researchers at Mass General Brigham have unveiled BRIDGE, a groundbreaking multilingual benchmark designed to assess the performance of large language models (LLMs) in understanding clinical patient-care texts, including language from electronic health records (EHRs). This tool evaluates LLMs across nine languages and is aimed at bridging the gap between AI's performance on standardized medical exams and real-world clinical tasks. The study published in Nature Biomedical Engineering highlights BRIDGE's focus on actual clinical data, providing a more nuanced evaluation of AI tools that could enhance patient care across diverse medical specializations.
Significantly, the findings reveal stark performance disparities: the highest-scoring LLM achieved just 44.8% on BRIDGE compared to 92% on traditional licensing exams. This discrepancy underscores the inadequacy of existing benchmarks, which often rely on standardized medical language, failing to capture the complexities of clinical interactions. By systematically evaluating 95 LLMs across various tasks—such as diagnosis and billing coding—BRIDGE aims to assist clinicians in selecting appropriate AI tools while guiding developers to improve model accuracy and equity, particularly for non-English-speaking patients. The introduction of a dynamic leaderboard further enables ongoing comparisons, making this a pivotal advancement for the AI/ML community in healthcare.
Loading comments...
login to comment
loading comments...
no comments yet