Original Investigation
Emergency Medicine
May 7, 2024
Use of a Large Language Model to Assess Clinical Acuity of Adults in the Emergency Department
Christopher Y. K. Williams, Travis Zack, Brenda Y. Miao, et al
JAMA Netw Open. 2024;7(5):e248895. doi:10.1001/jamanetworkopen.2024.8895
Question Can a large language model (LLM) accurately assess clinical acuity in the emergency department (ED)?
Findings This cross-sectional study of 251 401 adult ED visits investigated the potential for an LLM to classify acuity levels of patients in the ED based on the Emergency Severity Index across 10 000 patient pairs. The LLM demonstrated accuracy of 89% and was comparable with human physician classification in a 500-pair subsample.
Meaning These findings suggest that LLMs could accurately identify higher-acuity patient presentation when given pairs of presenting histories extracted from patients’ first ED documentation.
Abstract
Importance The introduction of large language models (LLMs), such as Generative Pre-trained Transformer 4 (GPT-4; OpenAI), has generated significant interest in health care, yet studies evaluating their performance in a clinical setting are lacking. Determination of clinical acuity, a measure of a patient’s illness severity and level of required medical attention, is one of the foundational elements of medical reasoning in emergency medicine.
Objective To determine whether an LLM can accurately assess clinical acuity in the emergency department (ED).
Design, Setting, and Participants This cross-sectional study identified all adult ED visits from January 1, 2012, to January 17, 2023, at the University of California, San Francisco, with a documented Emergency Severity Index (ESI) acuity level (immediate, emergent, urgent, less urgent, or nonurgent) and with a corresponding ED physician note. A sample of 10 000 pairs of ED visits with nonequivalent ESI scores, balanced for each of the 10 possible pairs of 5 ESI scores, was selected at random.
Exposure The potential of the LLM to classify acuity levels of patients in the ED based on the ESI across 10 000 patient pairs. Using deidentified clinical text, the LLM was queried to identify the patient with a higher-acuity presentation within each pair based on the patients’ clinical history. An earlier LLM was queried to allow comparison with this model.
Main Outcomes and Measures Accuracy score was calculated to evaluate the performance of both LLMs across the 10 000-pair sample. A 500-pair subsample was manually classified by a physician reviewer to compare performance between the LLMs and human classification.
Results From a total of 251 401 adult ED visits, a balanced sample of 10 000 patient pairs was created wherein each pair comprised patients with disparate ESI acuity scores. Across this sample, the LLM correctly inferred the patient with higher acuity for 8940 of 10 000 pairs (accuracy, 0.89 [95% CI, 0.89-0.90]). Performance of the comparator LLM (accuracy, 0.84 [95% CI, 0.83-0.84]) was below that of its successor. Among the 500-pair subsample that was also manually classified, LLM performance (accuracy, 0.88 [95% CI, 0.86-0.91]) was comparable with that of the physician reviewer (accuracy, 0.86 [95% CI, 0.83-0.89]).
Conclusions and Relevance In this cross-sectional study of 10 000 pairs of ED visits, the LLM accurately identified the patient with higher acuity when given pairs of presenting histories extracted from patients’ first ED documentation. These findings suggest that the integration of an LLM into ED workflows could enhance triage processes while maintaining triage quality and warrants further investigation.