现在的位置: 首页时讯速递, 进展交流>正文
[JAMA Netw Open发表论文]:大语言模型表现与临床推理任务
2026年07月05日 时讯速递, 进展交流 [JAMA Netw Open发表论文]:大语言模型表现与临床推理任务已关闭评论

Original Investigation 

Health Informatics

Large Language Model Performance and Clinical Reasoning Tasks

Arya S. Rao, Kaiz P. Esmail, Richard S. Lee, et al

JAMA Netw Open 2026;9;(4):e264003. doi:10.1001/jamanetworkopen.2026.4003

Key Points

Question  Can off-the-shelf large language models (LLMs) demonstrate reliable performance across the clinical workflow?

Findings  In this cross-sectional study of 21 frontier LLMs tested on 29 standardized clinical vignettes, Grok 4 and other reasoning-optimized models achieved the highest scores, while Gemini 1.5 Flash performed lowest. Differential diagnosis consistently showed the weakest performance, while final diagnosis and management had stronger performances.

Meaning  These findings suggest that despite progress, current LLMs remain limited in early diagnostic reasoning and cannot yet be relied on for unsupervised patient-facing clinical decision-making.

Abstract

Importance  Large language models (LLMs) are increasingly marketed for clinical use, yet their ability to replicate full-spectrum clinical reasoning remains uncertain. Existing evaluations often rely on multiple-choice examinations that do not reflect the complexity of patient care.

Objectives  To evaluate the longitudinal clinical reasoning ability of state-of-the-art LLMs and to introduce a multidimensional, clinically meaningful benchmark for clinical-grade artificial intelligence (AI).

Design, Setting, and Participants  In this cross-sectional study, performance was evaluated using standardized clinical vignettes from the January 2025 update of MSD Manual vignettes. A total of 21 off-the-shelf LLMs, including recently released GPT-5, Claude 4.5 Opus, Gemini 3.0 Flash and Pro, and Grok 4, were evaluated. Models were assessed by medical student scorers in triplicate across sequential stages of the standard clinical workflow. Analyses were performed from January to December 2025.

Main Outcomes and Measures  The primary outcome was the Proportional Index of Medical Evaluation for LLMs (PrIME-LLM) score, defined as the normalized polygonal area representing balanced accuracy across 5 domains of clinical reasoning as follows: differential diagnosis, diagnostic testing, final diagnosis, management, and miscellaneous clinical reasoning questions. Analyses including analyses of variance, t tests, and regression models were used to compare AI model performance and demographic associations.

Results  LLMs were tested across 29 clinical vignettes (representing 16 254 responses in total). PrIME-LLM scores ranged from 0.64 (range, 0.63-0.65) (Gemini 1.5 Flash) to 0.78 (range, 0.77-0.79) (Grok 4), with reasoning-optimized models outperforming nonreasoning models and GPT models scoring highest overall. Differential diagnosis was less accurate than diagnostic testing, while final diagnosis, management, and miscellaneous reasoning were more accurate. Failure rates exceeded 0.80 (range, 0.90-1.00) for differential diagnosis in all models but were less than 0.40 (range, 0.09-0.39) for final diagnosis. Multimodal performance was robust; most LLM models showed improved accuracy with image inputs.

Conclusions and Relevance  In this cross-sectional study of 21 LLMs, frontier LLMs achieved high accuracy on final diagnoses but performed poorly in generating differential diagnoses and navigating uncertainty relative to other reasoning stages. The PrIME-LLM framework provided greater separation than raw accuracy, revealing critical reasoning gaps obscured by traditional benchmarks. Thus, despite version-based improvements and advantages in reasoning-optimized models, off-the-shelf LLMs have not yet achieved the intelligence required for safe deployment and remain limited in demonstrating advanced clinical reasoning.

抱歉!评论已关闭.

×
腾讯微博