Research Letter
March 18, 2024
Comparative Analysis of Multimodal Large Language Model Performance on Clinical Vignette Questions
Tianyu Han, Lisa C. Adams, Keno K. Bressem, et al
JAMA. Published online March 18, 2024. doi:10.1001/jama.2023.27861
Large language models (LLMs) have attracted attention for their ability to process text and other complex content.1,2 In medicine, an LLM-based advisory chat agent could assist in research, education, and clinical care.3
Currently, one model is GPT-4 (OpenAI), released in March 2023. Initially, GPT-4’s limitation to text-only input posed a challenge for medical applications, which often rely on visual data. As an extension, GPT-4V(ision), a multimodal version of GPT-4 that can process visual input in addition to text, was released in October 2023. In December 2023, Google announced the release of its own multimodal model, Gemini Pro, as a competitor to GPT-4V. Although GPT-4 and Gemini Pro are proprietary, several open-source models have been developed that do not require sending data to third parties. We evaluated the performance of GPT-4V and Gemini Pro on a series of multimodal clinical vignette quiz questions with images from medical journals and compared them with competing open-source models.
Methods
We used Clinical Challenges from JAMA and Image Challenges from the New England Journal of Medicine (NEJM) published between January 1, 2017, and August 31, 2023, to assess the diagnostic accuracy of GPT-4V, Gemini Pro, and 4 language-only models: GPT-4, GPT-3.5, and 2 open-source models, Llama 2 (Meta) and Med42. Med42 is a variant of Llama 2 adapted for medical applications (eTable in Supplement 1).
Both JAMA and NEJM cases contain a case description, a medical image, and the questions “What would you do next?” (JAMA) and “What is the diagnosis?” (NEJM), with 4 (JAMA) or 5 (NEJM) answer choices. JAMA cases with different questions were excluded. For NEJMchallenge items, we extracted statistics on NEJM subscriber responses and stratified question difficulty by percentage of correct human responses into 4 equal intervals.
Case descriptions and answer choices were fed into the models, and the models were asked to select the correct answer. For GPT-4V and Gemini Pro, we provided images along with case description. Statistical analyses were conducted with Python 3.7.10, calculating 95% CIs from binomial distributions and comparing LLM accuracy via 2-tailed t tests at P < .05 significance.
Results
On 140 JAMA questions, GPT-4V consistently achieved the highest accuracy, with 73.3% (95% CI, 66.3%-80.9%) vs 55.7% (95% CI, 47.5%-63.9%) for Gemini Pro, 63.6% (95% CI, 55.6%-71.5%) for GPT-4, 50.7% (95% CI, 42.4%-59.0%) for GPT-3.5, 53.6% (95% CI, 45.3%-61.8%) for Med42, and 41.4% (95% CI, 33.3%-49.6%) for Llama 2 (all pairwise comparisons P < .001). Results were similar for the 348 NEJM questions, with GPT-4V and Gemini Pro correctly answering 88.7% (95% CI, 85.5%-92.1%) and 68.7% (95% CI, 63.8%-73.6%), respectively (Figure 1). Although Med42 outperformed GPT-3.5 on the JAMA challenge, it was inferior to GPT-3.5 on the NEJM challenge (59.9% [95% CI, 54.6%-64.9%] vs 61.7% [95% CI, 56.7%-66.9%]; P < .001). For the NEJM challenges, human readers correctly answered 51.4% (95% CI, 46.2%-56.7%) of questions.
When stratified by question difficulty, the NEJM results were similar for the highest 3 difficulty levels, but GPT-4V, Gemini Pro, GPT-4, and Med42 all had 100% accuracy for the lowest-difficulty questions (Figure 2).


Discussion
In both the JAMA Clinical Challenge and the NEJM Image Challenge databases, GPT-4V demonstrated significantly better accuracy than its unimodal predecessors, GPT-4 and GPT-3.5, and Gemini Pro, as well as the open-source models Llama 2 and Med42, confirming that GPT-4V can interpret medical images even without dedicated fine-tuning. Although the findings are promising, caution is warranted because these were curated vignettes from medical journals and do not fully represent the medical decision-making skills required in clinical practice.4 The integration of artificial intelligence models must consider their role in different clinical scenarios and the broader ethical implications.
This study has limitations. First, the training data of the proprietary models may have included cases used in this study. Second, the clinical challenges do not simulate clinical practice because physicians would not provide multiple choice answers to a model to obtain help with diagnosis. Third, the study does not contain all available multimodal LLMs.