现在的位置: 首页研究点评, 进展交流>正文
[ICU Management & Practice]: 临床推理中的大语言模型
2026年02月20日 研究点评, 进展交流 [ICU Management & Practice]: 临床推理中的大语言模型已关闭评论

Large Language Models in Clinical Reasoning

  • In ICU
  • Thu, 2 Oct 2025

Large language models (LLMs) are increasingly used in clinical care for documentation, communication, and decision support, but their reasoning abilities remain difficult to evaluate. Traditional benchmarks, though useful, often measure knowledge recall rather than true clinical reasoning.

Script Concordance Testing (SCT) offers a scalable way to assess a core skill: adjusting diagnostic or management judgments under uncertainty. SCT compares model responses with those of expert clinicians, capturing legitimate variation in reasoning. 

Researchers have released the first open SCT benchmark to test how well LLMs integrate new clinical information, compare with human practitioners, and vary across architectures. While SCT is a proxy that cannot fully replicate real-world reasoning, it highlights performance gaps and provides a bridge between conventional benchmarks and clinical deployment.

A public benchmark of 750 SCT questions was created from 10 international, multispecialty datasets (9 newly released). Each vignette asks how new information alters a diagnosis or management option, with answers scored against expert panels. Performance was compared across 10 leading LLMs and 1,563 human participants, including 1,070 medical students, 193 residents, and 300 attending physicians.

LLMs scored substantially lower on SCT than on standard medical multiple-choice benchmarks. OpenAI’s o3 led with 67.8%, followed by GPT-4o at 63.9%, while other reasoning-optimised models, including o1-preview (58.2%) and DeepSeek R1 (55.5%), lagged, and Gemini 2.5 performed worst (52.1%). Although models often matched or exceeded medical students, they fell short of residents and attending physicians. 

Analysis revealed systematic overconfidence: reasoning-focused models overused extreme ratings and rarely selected neutral options, suggesting that chain-of-thought tuning may impair adaptability under clinical uncertainty.

Overall, this study shows that SCT provides a distinctive, challenging benchmark for assessing LLM clinical reasoning. Unlike multiple-choice tests, SCT highlights major gaps between models and expert clinicians, even for top-performing systems like OpenAI’s o3. While models often match medical students, they fall short of residents and attendings, reflecting SCT’s focus on flexible judgment under uncertainty rather than factual recall.

Findings reveal systematic overconfidence in reasoning-tuned models, which overuse extreme responses and under-recognise when information should not change a hypothesis. This mirrors previous evidence that chain-of-thought prompting can impair performance on tasks requiring subtle, probabilistic updating. The benchmark’s global diversity and inclusion of multiple training levels strengthen its validity, though limitations include prompt sensitivity, scoring artefacts, and the context-specific nature of clinical reasoning.

SCT is a valuable tool: it captures probabilistic, context-sensitive reasoning critical for clinical decision-making and helps identify where LLMs diverge from expert performance. The publicly available benchmark provides a foundation for improving reasoning in AI and advancing toward richer, trial-based evaluations of models in real-world care. SCT’s concordance-based scoring captures nuanced reasoning under uncertainty and highlights the challenges of evaluating AI as it nears human-level performance. 

Source: NEJM AI
Image Credit: Brian Penny from Pixabay

References:

McCoy LG, Swamy R, Sagar N et al. (2025) Assessment of Large Language Models in Clinical Reasoning: A Novel Benchmarking Study. NEJM AI. 2(10).

抱歉!评论已关闭.

×
腾讯微博