{"id":29263,"date":"2026-02-27T04:03:00","date_gmt":"2026-02-26T20:03:00","guid":{"rendered":"https:\/\/csccm.org.cn\/?p=29263"},"modified":"2026-02-27T05:53:10","modified_gmt":"2026-02-26T21:53:10","slug":"nejm-ai%e5%8f%91%e8%a1%a8%e8%ae%ba%e6%96%87%ef%bc%9a%e5%a4%a7%e8%af%ad%e8%a8%80%e6%a8%a1%e5%9e%8b%e5%9c%a8%e4%b8%b4%e5%ba%8a%e6%8e%a8%e7%90%86%e8%bf%87%e7%a8%8b%e4%b8%ad%e7%9a%84%e8%af%84%e4%bb%b7","status":"publish","type":"post","link":"https:\/\/csccm.org.cn\/?p=29263","title":{"rendered":"[NEJM AI\u53d1\u8868\u8bba\u6587]\uff1a\u5927\u8bed\u8a00\u6a21\u578b\u5728\u4e34\u5e8a\u63a8\u7406\u8fc7\u7a0b\u4e2d\u7684\u8bc4\u4ef7"},"content":{"rendered":"\n<p><a href=\"https:\/\/ai.nejm.org\/browse\/ai-article-type\/datasets-benchmarks-protocols\">DATASETS, BENCHMARKS, AND PROTOCOLS<\/a><\/p>\n\n\n\n<h1 class=\"wp-block-heading\">Assessment of Large Language Models in Clinical Reasoning: A Novel Benchmarking Study<\/h1>\n\n\n\n<h3 class=\"wp-block-heading\">Liam G.&nbsp;McCoy,&nbsp;Rajiv&nbsp;Swamy,&nbsp;Nidhish&nbsp;Sagar,&nbsp;et al<\/h3>\n\n\n\n<h3 class=\"wp-block-heading\">NEJM AI&nbsp;2025;2(10) Published&nbsp;September 25, 2025<\/h3>\n\n\n\n<h3 class=\"wp-block-heading\">DOI: 10.1056\/AIdbp2500120<\/h3>\n\n\n\n<h2 class=\"wp-block-heading\">Abstract<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">BACKGROUND<\/h3>\n\n\n\n<p>Large language models (LLMs) are increasingly deployed for clinical decision support, yet standard evaluations such as medical licensing exams overlook how clinicians update decisions in dynamic contexts. Script concordance testing (SCT), a decades-old assessment tool, measures how new information adjusts diagnostic or therapeutic judgments under uncertainty.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">METHODS<\/h3>\n\n\n\n<p>We built a public benchmark of 750 SCT questions drawn from 10 international, diverse datasets \u2014 9 of which were newly released \u2014 spanning multiple specialties. Each item presents a clinical vignette and asks how added data change the likelihood of a diagnosis or management option, scored against expert-panel responses. Ten state-of-the-art LLMs were compared with 1070 medical students (1026 medical; 44 physiotherapy), 193 residents, and 300 attending physicians.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">RESULTS<\/h3>\n\n\n\n<p>LLMs demonstrated markedly lower performance on SCT than their typical achievement on medical multiple-choice benchmarks. Across prompting conditions, OpenAI\u2019s o3 achieved the highest performance (67.8%\u00b11.2%), followed by GPT-4o (63.9%\u00b11.3%). Other reasoning-optimized models, including OpenAI\u2019s o1-preview (58.2%\u00b11.3%) and DeepSeek R1 (55.5%\u00b11.4%), performed significantly lower, with Google\u2019s Gemini 2.5 Pro Preview (Gemini 2.5) (52.1%\u00b11.4%) exhibiting the poorest results overall, while other non-reasoning models showed middling performance. Models matched or exceeded student performance on multiple examinations but did not reach the level of senior residents or attending physicians. Response-pattern analysis showed systematic overconfidence, with reasoning-tuned models overusing extreme ratings (+2\/\u22122) and seldom choosing 0, implying that chain-of-thought optimizations may hinder flexible clinical reasoning under uncertainty.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f1.jpg\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"995\" src=\"https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f1-1024x995.jpg\" alt=\"\" class=\"wp-image-30120\" srcset=\"https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f1-1024x995.jpg 1024w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f1-300x291.jpg 300w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f1-768x746.jpg 768w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f1-1536x1492.jpg 1536w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f1.jpg 2009w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f2.jpg\"><img decoding=\"async\" loading=\"lazy\" width=\"983\" height=\"1024\" src=\"https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f2-983x1024.jpg\" alt=\"\" class=\"wp-image-30121\" srcset=\"https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f2-983x1024.jpg 983w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f2-288x300.jpg 288w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f2-768x800.jpg 768w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f2-1474x1536.jpg 1474w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f2-1966x2048.jpg 1966w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f2.jpg 2012w\" sizes=\"(max-width: 983px) 100vw, 983px\" \/><\/a><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f3.jpg\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"716\" src=\"https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f3-1024x716.jpg\" alt=\"\" class=\"wp-image-30122\" srcset=\"https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f3-1024x716.jpg 1024w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f3-300x210.jpg 300w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f3-768x537.jpg 768w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f3-1536x1074.jpg 1536w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f3.jpg 1800w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f4.jpg\"><img decoding=\"async\" loading=\"lazy\" width=\"1024\" height=\"670\" src=\"https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f4-1024x670.jpg\" alt=\"\" class=\"wp-image-30123\" srcset=\"https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f4-1024x670.jpg 1024w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f4-300x196.jpg 300w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f4-768x503.jpg 768w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f4-1536x1005.jpg 1536w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f4-2048x1340.jpg 2048w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_f4-236x155.jpg 236w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/a><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_t1.jpg\"><img decoding=\"async\" loading=\"lazy\" width=\"975\" height=\"1024\" src=\"https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_t1-975x1024.jpg\" alt=\"\" class=\"wp-image-30124\" srcset=\"https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_t1-975x1024.jpg 975w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_t1-286x300.jpg 286w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_t1-768x806.jpg 768w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_t1-1463x1536.jpg 1463w, https:\/\/csccm.org.cn\/wp-content\/uploads\/2026\/02\/aidbp2500120_t1-1951x2048.jpg 1951w\" sizes=\"(max-width: 975px) 100vw, 975px\" \/><\/a><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">CONCLUSIONS<\/h3>\n\n\n\n<p>SCT exposes persistent limitations in LLM clinical reasoning, especially in models optimized for explicit reasoning. Although SCT performance offers analogies to human probabilistic adjustment, it represents only one facet of evaluating artificial intelligence decision support. Our human-validated benchmark, now publicly available, provides a rigorous tool for advancing the assessment of medical artificial intelligence systems. (Funded by the Fulbright Program and others.)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>DATASETS, BENCHMARKS, AND PROTOCOLS Assessment of Large [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[32,23],"tags":[],"_links":{"self":[{"href":"https:\/\/csccm.org.cn\/index.php?rest_route=\/wp\/v2\/posts\/29263"}],"collection":[{"href":"https:\/\/csccm.org.cn\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/csccm.org.cn\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/csccm.org.cn\/index.php?rest_route=\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/csccm.org.cn\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=29263"}],"version-history":[{"count":2,"href":"https:\/\/csccm.org.cn\/index.php?rest_route=\/wp\/v2\/posts\/29263\/revisions"}],"predecessor-version":[{"id":30125,"href":"https:\/\/csccm.org.cn\/index.php?rest_route=\/wp\/v2\/posts\/29263\/revisions\/30125"}],"wp:attachment":[{"href":"https:\/\/csccm.org.cn\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=29263"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/csccm.org.cn\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=29263"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/csccm.org.cn\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=29263"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}