Comparing Generative Artificial Intelligence vs Experts for Detection of Catheter-Associated Urinary Tract Infection

Shatha Alshanqeeti, Karen Coffey, Karrie Mcdermott, et al

Clinical Infectious Diseases, Volume 82, Issue 1, 15 January 2026, Pages 148–150, https://doi.org/10.1093/cid/ciaf486

Abstract

A retrospective cohort study compared generative artificial intelligence (GenAI) versus infection control expert for catheter-associated urinary tract infection (CAUTI) detection. Sensitivity was 95.2% (20/21), and specificity was 76.2% (16/21), improving to 90% with expert review of GenAI output. GenAI is a promising tool to assist in CAUTI surveillance.

Catheter-associated urinary tract infections (CAUTI) are the most commonly reported healthcare-associated infection (HAI) and are associated with increased morbidity, mortality, and healthcare costs [1]. CAUTI surveillance and reporting require subjective and resource-intensive manual chart review [2]. Experimental evaluations of generative artificial intelligence (GenAI) with simulated CAUTI cases to assist in HAI surveillance demonstrate promise [2–4]. To the best of our knowledge, there are no studies evaluating performance of GenAI for detecting CAUTI using clinical data. We performed a retrospective matched cohort study at a single medical center evaluating GenAI compared to human experts for CAUTI detection.

METHODS

This study was conducted across all acute care units at the Veteran's Affairs Maryland Health Care System (VAMHCS), a 124-bed acute care hospital in Baltimore, Maryland. A total of 231 positive urine cultures were identified in surveillance records from 2021 to 2024. This included urine cultures from patients with and without urinary catheters. Each positive urine culture was considered a distinct event, and each single patient could have multiple events if they had multiple positive urine cultures during the study period. All 231 cultures were previously reviewed by the infection control team, initial review was performed by infection control (IC) nurses followed by confirmation by a hospital epidemiology physician. A total of 21 (9%) cases were deemed CAUTI and the incidence of CAUTI during these years was above national average. All 21 reported CAUTIs were identified and 21 positive urine cultures deemed not CAUTI were randomly selected from the surveillance records. A single continuous document in the “docx.” format was assigned to each case, patient data was extracted from the electronic health record and added to this document. Data extracted included clinical notes, as well as all vital signs and microbiological results during admission. Clinical notes were limited to specific notes due to limits to the amount of data that the secure Chat GPT-4o could process and included the following: the initial history and physical exam note; the primary physician and nursing progress notes during the infection window period from day −3 to day 3; and the discharge summary. GenAI prompts were developed based on previously described prompts [5] and Center for Disease Control (CDC) National Health and Safety Network (NHSN) definitions [6] in a decision rule algorithm (Supplementary Appendix A). The OpenAI GPT 4o model was accessed through the Office of Digital Health GPT(beta) pilot. All work was done on the secure VA OpenAI GPT 4o model, a model that is fully protected and does not retain patient data. Temperature was set to 0 in our study (temperature measures the randomness of LLM output and ranges from 0 to 2 with 0 indicating limited randomness and higher temperatures resulting in increased creativity of output [7]). Prompts were entered into VA GPT (Beta) chat box followed by the patient data; a new window was used for each patient. Time required for the process was estimated by the physician performing all reviews (author S.A.). This was a quality improvement assessment and deemed non-human subjects research by the University of Maryland Institutional Review Board (IRB) and the VA Research & Development committee.

RESULTS

All 42 patients had charts with necessary records available for review. Among these patients, 92.8% (39/42) were male (20/21 of CAUTI group and 19/21 of not-CAUTI group). Median age was 72 (interquartile range [IQR] 67.5–79.5) in CAUTI group and 73 (IQR 66–77.5) in not-CAUTI group. Average LOS was 29.5 days (SD 63.7) in CAUTI group and 18.3 days (SD 8.2) in not-CAUTI group. Among all patients included, overall agreement between expert review and AI review was 85.7% (36/42). Sensitivity for CAUTI was 95.2% (20/21) (95% CI 76.2–99.9%) and specificity was 76.2% (16/21) (95% confidence interval [CI]: 52.8–91.8%). One CAUTI was missed by GenAI due to failure to recognize a qualifying symptom (Table 1). Four not-CAUTI cases were identified as CAUTI (false-positives) by GenAI because of assuming the presence of a catheter or including non-qualifying symptoms (Table 1). Human review of GenAI output identified 2 of 4 false positive errors made by GenAI. Additionally, a true CAUTI missed by initial expert review and correctly identified by GenAI was retrospectively identified during human review of GenAI output. Therefore, specificity could be improved to 90% (18/20) with expert human review of AI output. Human review of GenAI output required <5 minutes. Examples of GenAI output are provided in Supplementary Appendix A. The time spent identifying clinical data, copying and pasting, and feeding data into GenAI ranged between 10 and 15 minutes per case.

Table 1. Accuracy of GenAI for Identifying CAUTI or Not-CAUTI Compared to Expert Human Review

Sensitivity	Error of Missing CAUTI (false-negative)		GenAI Output	Quote From Medical Record	Error Identifiable From GenAI Output
95.2% (20/21)	GenAI missed qualifying symptoms: Case A: abdominal and flank pain		“Does not indicate the presence of fever, suprapubic tenderness, or costovertebral angle tenderness”	“Lower abdomen/L flank tenderness Lower abdomen/L flank tenderness noted on exam.”	No
Specificity	Errors of wrongly identifying CAUTI (false-positives)		GenAI output	Quote from medical record	Error identifiable from GenAI output
76.2% (16/21)	Wrong assumption of a urinary catheter	Case A	“Yes, the patient had an indwelling urinary catheter.”	“Attempted to place Foley, but patient was not tolerating at bedside”	No
	Wrong assumption of a urinary catheter	Case B	“Does the patient have a urinary catheter? Yes.”	“Lines, Tubes, & Drains: PIV, Foley removed”	No
	Included non-qualifying symptoms:	Case C	“The patient's clinical notes describe dysuria and pus from the catheter”^a	“Had pus/drainage from Foley” “pt c/o dysuria/pain”	Yes
	Included non-qualifying symptoms:	Case D	“feeling hot and a core temperature of 37.9°C, complaints of blood-tinged sputum and N/V.”	“Reports feeling warm overnight, though no fevers”	Yes
	Initial human review missed a true CAUTI^b	Case E: Missed symptoms (suprapubic tenderness)	“Documented suprapubic tenderness and initiation of antibiotic treatment.”	“New suprapubic pain”	Yes

Abbreviations: CAUTI, catheter-associated urinary tract infection; GenAI, generative artificial intelligence; PIV, peripheral Intravenous.

^aDysuria is a non-qualifying symptom when a catheter is present.

^bGenAI made the correct determination identifying a CAUTI accurately but for this study, it was initially considered a false positive compared to the reference standard.

DISCUSSION

In a single hospital matched cohort study reviewing all CAUTIs for 4 years, we found GenAI identified nearly all CAUTIs and effectively excluded the majority of patients with positive urine cultures without CAUTI. Notably, the reference standard expert CAUTI review was incorrect 1 time that was identified by GenAI, and most errors of GenAI could be identified by an expert human reviewer based on GenAI output. CAUTI is an important HAI that must be reported by US hospitals and is tied to reimbursement. HAI surveillance and reporting generally requires a substantial portion of expert IC nurse time, up to 61% of all activities [8]. HAI reporting distracted from other activities for nearly all IC nurses [9]. The amount of time spent by IC expert for a manual review and surveillance is variable but can reach up to half the working hours [2]. This can potentially be reduced as the time spent to review one case using GenAI ranged between 10 and 15 minutes. The potential to automate or semi-automate reporting could improve efficiency of expert IC nurses. Our findings suggest that GenAI is best used to augment the performance of expert reviewers but not replace human review. Furthermore, it may improve equity in resource-limited settings that have fewer highly trained reviewers.

We found GenAI was able to effectively extract key elements for CAUTI determination from clinical data and present these as output. This is similar to earlier studies using simulated cases [2,3]. Notably, GenAI was highly reliable at providing the necessary criteria for review but sometimes incorrectly reasoned that a CAUTI was present, in spite of those criteria. This emphasizes the importance of expert review of GenAI to review output and confirm CAUTI. Assuming human experts reviewed the output from GenAI to apply rules, GenAI augmented by human review could achieve 95.5% sensitivity and 90.5% specificity (from 3 cases with not-CAUTI determination evident from output but wrong by GenAI global assessment). This implies that expert humans in the loop are important for clinical implementation and GenAI should be designed that maximizes transparency in providing criteria for CAUTI. In addition, false positive errors which could not be identified with auditing of GenAI output were due to incorrect identification of catheter presence. It is important to note that in the electronic medical records used for this study, a designated device flowsheet section was not available, and the presence or absence of a catheter was determined using progress notes. There is potential to further improve specificity in electronic health records with a device flowsheet.

The full impact of use of GenAI to assist in CAUTI determination must consider the low prevalence of CAUTI in patients with positive urine cultures. In this hospital from 2021 to 2024, 9% of cases were deemed CAUTI. Therefore, within this hospital the positive-predictive value of a GenAI CAUTI determination would be 28.4% and the negative-predictive value of a GenAI not-CAUTI determination would be 99.4%. The time required for review in this study was 10–15 minutes, which, although it compares favorably with traditional human review, could be much faster with automation of GenAI.

Our study had limitations including being single site with a small sample size and a predominantly male population, using a single GenAI model (OpenAI-Chat GPT-4o) due to lack of other accessible HIPAA and privacy-compliant GenAI systems, requiring cutting and pasting of data and prompts into a chatbot to approximate what automation could achieve, and using unblinded human review to determine necessary elements for CAUTI or not CAUTI from output.

CONCLUSION

Generative AI was generally accurate at identifying CAUTI in hospitalized patients. Automated GenAI assisted human review could improve CAUTI metric identification and reporting.