Large Language Model Performance and Clinical Reasoning Tasks

Arya S. Rao, Kaiz P. Esmail, Richard S. Lee, et al

JAMA Netw Open 2026;9;(4):e264003. doi:10.1001/jamanetworkopen.2026.4003

Key Points

Question Can off-the-shelf large language models (LLMs) demonstrate reliable performance across the clinical workflow?

Findings In this cross-sectional study of 21 frontier LLMs tested on 29 standardized clinical vignettes, Grok 4 and other reasoning-optimized models achieved the highest scores, while Gemini 1.5 Flash performed lowest. Differential diagnosis consistently showed the weakest performance, while final diagnosis and management had stronger performances.

Meaning These findings suggest that despite progress, current LLMs remain limited in early diagnostic reasoning and cannot yet be relied on for unsupervised patient-facing clinical decision-making.

Abstract

Importance Large language models (LLMs) are increasingly marketed for clinical use, yet their ability to replicate full-spectrum clinical reasoning remains uncertain. Existing evaluations often rely on multiple-choice examinations that do not reflect the complexity of patient care.

Objectives To evaluate the longitudinal clinical reasoning ability of state-of-the-art LLMs and to introduce a multidimensional, clinically meaningful benchmark for clinical-grade artificial intelligence (AI).

Design, Setting, and Participants In this cross-sectional study, performance was evaluated using standardized clinical vignettes from the January 2025 update of MSD Manual vignettes. A total of 21 off-the-shelf LLMs, including recently released GPT-5, Claude 4.5 Opus, Gemini 3.0 Flash and Pro, and Grok 4, were evaluated. Models were assessed by medical student scorers in triplicate across sequential stages of the standard clinical workflow. Analyses were performed from January to December 2025.

Main Outcomes and Measures The primary outcome was the Proportional Index of Medical Evaluation for LLMs (PrIME-LLM) score, defined as the normalized polygonal area representing balanced accuracy across 5 domains of clinical reasoning as follows: differential diagnosis, diagnostic testing, final diagnosis, management, and miscellaneous clinical reasoning questions. Analyses including analyses of variance, t tests, and regression models were used to compare AI model performance and demographic associations.

Results LLMs were tested across 29 clinical vignettes (representing 16 254 responses in total). PrIME-LLM scores ranged from 0.64 (range, 0.63-0.65) (Gemini 1.5 Flash) to 0.78 (range, 0.77-0.79) (Grok 4), with reasoning-optimized models outperforming nonreasoning models and GPT models scoring highest overall. Differential diagnosis was less accurate than diagnostic testing, while final diagnosis, management, and miscellaneous reasoning were more accurate. Failure rates exceeded 0.80 (range, 0.90-1.00) for differential diagnosis in all models but were less than 0.40 (range, 0.09-0.39) for final diagnosis. Multimodal performance was robust; most LLM models showed improved accuracy with image inputs.

Conclusions and Relevance In this cross-sectional study of 21 LLMs, frontier LLMs achieved high accuracy on final diagnoses but performed poorly in generating differential diagnoses and navigating uncertainty relative to other reasoning stages. The PrIME-LLM framework provided greater separation than raw accuracy, revealing critical reasoning gaps obscured by traditional benchmarks. Thus, despite version-based improvements and advantages in reasoning-optimized models, off-the-shelf LLMs have not yet achieved the intelligence required for safe deployment and remain limited in demonstrating advanced clinical reasoning.

作者: dubin98

该日志由 dubin98 于1小时前发表在时讯速递, 进展交流分类下，
转载请注明: [JAMA Netw Open发表论文]：大语言模型表现与临床推理任务 | 中国病理生理学会危重病医学专业委员会 +复制链接

【上篇】[NEJM发表述评]：糖皮质激素治疗川崎病：重新定义适应证

抱歉!评论已关闭.

Large Language Model Performance and Clinical Reasoning Tasks

Arya S. Rao, Kaiz P. Esmail, Richard S. Lee, et al

JAMA Netw Open 2026;9;(4):e264003. doi:10.1001/jamanetworkopen.2026.4003

作者: dubin98

最活跃的读者

返回首页