现在的位置: 首页时讯速递, 进展交流>正文
[JAMA Intern Med发表论文]:生成式人工智能模型与医生临床推理的比较
2024年06月17日 时讯速递, 进展交流 [JAMA Intern Med发表论文]:生成式人工智能模型与医生临床推理的比较已关闭评论

Research Letter 

April 1, 2024

Clinical Reasoning of a Generative Artificial Intelligence Model Compared With Physicians

Stephanie Cabral, Daniel Restrepo, Zahir Kanjee, et al

JAMA Intern Med. Published online April 1, 2024. doi:10.1001/jamainternmed.2024.0295

Large language models (LLMs) have shown promise in clinical reasoning, but their ability to synthesize clinical encounter data into problem representations remains unexplored.1-3 We compared an LLM’s reasoning abilities against human performance using standards developed for physicians.

Methods

We recruited internal medicine residents and attending physicians at 2 academic medical centers in Boston, Massachusetts, from July to August 2023. We used 20 clinical cases, each comprising 4 sections representing sequential stages of clinical data acquisition.4 We developed a survey instructing physicians to write a problem representation and prioritized differential diagnosis with justifications for each section (eTable 1 in Supplement 1). Each physician received the survey with 1 randomly selected case (4 sections). We developed a prompt with identical instructions (eTable 2 in Supplement 1) and ran all sections in GPT-4 (OpenAI) on August 17-18, 2023.5 The Massachusetts General Hospital and Beth Israel Deaconess Medical Center Institutional Review Boards deemed this cross-sectional study exempt from review. Written informed consent were obtained. We followed the STROBE reporting guideline.

Primary outcome was the Revised-IDEA (R-IDEA) score, a validated 10-point scale evaluating 4 core domains of clinical reasoning documentation (eTable 3 in Supplement 1).6 To establish reliability, we (D.R., Z.K., A.R.) independently scored 29 section responses from 8 nonparticipants, showing substantial scoring agreement (mean Cohen weighted κ = 0.61). Secondary outcomes included the presence of correct and incorrect reasoning, diagnostic accuracy, and cannot-miss diagnoses (eMethods in Supplement 1). Scorers were blinded to respondent type (chatbot, attending, resident).

Descriptive statistics were calculated for all outcomes. R-IDEA scores were binarized as low (0-7) or high (8-10). Associations between respondent type and score category were evaluated using logistic regression with random effects accounting for participant repeat structure (eMethods in Supplement 1). Significance was defined as 2-sided P < .05. Analyses were performed using R 4.2.1 (R Core Team).

Results

The sample included 21 attendings and 18 residents, who provided responses to a single case. Chatbot provided responses to all 20 cases. Two hundred thirty-six sections were completed, with 232 unique combinations of respondent type and section.

Median (IQR) R-IDEA scores were 10 (9-10) for chatbot, 9 (6-10) for attendings, and 8 (4-9) for residents (Table). In logistic regression analysis, chatbot had the highest estimated probability of achieving high R-IDEA scores (0.99; 95% CI, 0.98-1.00), followed by attendings (0.76; 95% CI, 0.51-1.00) and residents (0.56; 95% CI, 0.23-0.90), with chatbot being significantly higher than attendings (P = .002) and residents (P < .001) (Figure). Using Wilcoxon signed-rank test, chatbot had significantly higher R-IDEA scores than attendings (154; P = .003) and residents (127; P = .002). Attendings’ scores did not differ from residents’.

Chatbot performed similar to attendings and residents in diagnostic accuracy, correct clinical reasoning, and cannot-miss diagnosis inclusion. Median (IQR) inclusion rate of cannot-miss diagnoses in initial differentials were 66.7% (50.0%-100%) for chatbot, 50.0% (27.1%-100%) for attendings, and 66.7% (33.3%-81.2%) for residents (Table). Chatbot had more frequent instances of incorrect clinical reasoning (13.8%) than residents (2.8%; P = .04) but not attendings (12.5%; P = .89).

Discussion

An LLM was better than physicians in processing medical data and clinical reasoning using recognizable frameworks as measured by R-IDEA scores. Several other clinical reasoning outcomes showed no differences between physicians and chatbot, although chatbot had more instances of incorrect clinical reasoning than residents. This observation underscores the importance of multifaceted evaluations of LLM capabilities preceding their integration into the clinical workflow.

Study limitations include clinical data being provided in simulated cases; it is unclear how chatbot would perform in clinical scenarios. Physicians’ vignette responses may differ from clinical practice; however, physician R-IDEA scores in clinical documentation were lower than physician scores in this study.6 We also used a zero-shot approach for chatbot’s prompt. Iterative training could enhance LLM performance, suggesting that the results may have underestimated its capabilities. Future research should assess clinical reasoning of the LLM-physician interaction, as LLMs will more likely augment, not replace, the human reasoning process.

抱歉!评论已关闭.

×
腾讯微博