Performance of Large Language Models on Medical Oncology Examination Questions

Jack B. Longwell, Ian Hirsch, Fernando Binder, et al

JAMA Netw Open. 2024;7(6):e2417641. doi:10.1001/jamanetworkopen.2024.17641

Key Points

Question What medical oncology knowledge is encoded by large language models (LLMs)?

Findings In this cross-sectional study evaluating 8 LLMs, proprietary LLM 2 correctly answered 85.0% of examination-style multiple-choice questions from the American Society of Oncology, the European Society of Medical Oncology, and an original set from the authors, outperforming proprietary LLM 1 and open-source models. However, 81.8% of incorrect answers were rated as having a medium or high likelihood of moderate to severe harm if acted upon in practice.

Meaning These findings suggest that LLMs can accurately answer questions requiring advanced knowledge of medical oncology, although errors may cause harm.

Abstract

Importance Large language models (LLMs) recently developed an unprecedented ability to answer questions. Studies of LLMs from other fields may not generalize to medical oncology, a high-stakes clinical setting requiring rapid integration of new information.

Objective To evaluate the accuracy and safety of LLM answers on medical oncology examination questions.

Design, Setting, and Participants This cross-sectional study was conducted between May 28 and October 11, 2023. The American Society of Clinical Oncology (ASCO) Oncology Self-Assessment Series on ASCO Connection, the European Society of Medical Oncology (ESMO) Examination Trial questions, and an original set of board-style medical oncology multiple-choice questions were presented to 8 LLMs.

Main Outcomes and Measures The primary outcome was the percentage of correct answers. Medical oncologists evaluated the explanations provided by the best LLM for accuracy, classified the types of errors, and estimated the likelihood and extent of potential clinical harm.

Results Proprietary LLM 2 correctly answered 125 of 147 questions (85.0%; 95% CI, 78.2%-90.4%; P < .001 vs random answering). Proprietary LLM 2 outperformed an earlier version, proprietary LLM 1, which correctly answered 89 of 147 questions (60.5%; 95% CI, 52.2%-68.5%; P < .001), and the best open-source LLM, Mixtral-8x7B-v0.1, which correctly answered 87 of 147 questions (59.2%; 95% CI, 50.0%-66.4%; P < .001). The explanations provided by proprietary LLM 2 contained no or minor errors for 138 of 147 questions (93.9%; 95% CI, 88.7%-97.2%). Incorrect responses were most commonly associated with errors in information retrieval, particularly with recent publications, followed by erroneous reasoning and reading comprehension. If acted upon in clinical practice, 18 of 22 incorrect answers (81.8%; 95% CI, 59.7%-94.8%) would have a medium or high likelihood of moderate to severe harm.

Conclusions and Relevance In this cross-sectional study of the performance of LLMs on medical oncology examination questions, the best LLM answered questions with remarkable performance, although errors raised safety concerns. These results demonstrated an opportunity to develop and evaluate LLMs to improve health care clinician experiences and patient care, considering the potential impact on capabilities and safety.

作者: dubin98

该日志由 dubin98 于2年前发表在时讯速递, 进展交流分类下，
转载请注明: [JAMA Netw Open发表论文]：大语言模型在肿瘤学考题方面的表现 | 中国病理生理学会危重病医学专业委员会 +复制链接

【上篇】[JAMA Netw Open发表论文]：ERAS指南与住院日、再入院、并发症及病死率
【下篇】[ICU Management & Practice]: 成年菌血症患者的抗生素治疗

抱歉!评论已关闭.

Performance of Large Language Models on Medical Oncology Examination Questions

Jack B. Longwell, Ian Hirsch, Fernando Binder, et al

JAMA Netw Open. 2024;7(6):e2417641. doi:10.1001/jamanetworkopen.2024.17641

作者: dubin98

最活跃的读者

返回首页