现在的位置: 首页研究点评, 进展交流>正文
[Chest发表述评]:大语言模型作为写手:创新抑或局限?我们是否已经准备好交出手中的笔了?
2026年03月13日 研究点评, 进展交流 [Chest发表述评]:大语言模型作为写手:创新抑或局限?我们是否已经准备好交出手中的笔了?已关闭评论

EDITORIAL

Large Language Models as Item Writer: Innovation or Limitation? Are We Ready to Hand Over the Pen?

Abdullah Alismail

Chest 2025; 168: 1286-1287

The use of large language models and artificial intelligence (AI) in medical education has recently increased, not only among students, but also among faculty. Students use these tools to enhance their learning ability and comprehension of the subject, and faculty cite other reasons such as assisting them with their curriculum and course structure.1,2 Studies are encouraging faculty to stay up to date with these tools because changes are happening at a rapid pace and their use should be approached with caution.3,4 In this issue of CHEST, Safadi et al5 conducted a study titled “Quality of Human Expert vs Large Language Model-Generated Multiple-Choice Questions in the Field of Mechanical Ventilation.”

Safadi et al5 conducted a noninferiority cross-sectional study in which they compared multiple choice questions (MCQs) written by human experts vs AI-generated MCQs. The topic used in this study (to generate the MCQs) was mechanical ventilation. The authors used an interesting approach to answer the following question: Are MCQs written by AI noninferior to those written by human experts?

The authors followed an evidence-based method to writing examination questions by first establishing and identifying the learning objectives and outcomes (Table 1).5 To accomplish this task, they identified the following subtopics of mechanical ventilation: equation of motion, Ohm’s law, and Tau & Auto PEEP.

These objectives were then given to the 2 groups, the human experts and the AI tool. First, the human experts group were asked to write and create a total of 15 MCQs. Then the AI was asked to do the same. When it comes to AI, the authors used specific steps to achieve this task by using the “prompting” feature in AI.5 This step is an important component in AI tools that has been raised recently in the literature.6 In this study, the authors ensured that all agreed on a prompt template to guide the AI tool appropriately. The prompt and template included the learning objectives and identified the audience, who were fellows (physicians) in critical care medicine; the next prompt was to generate a question with 5 answer choices and to identify the correct answer.

Safadi et al5 asked 31 expert physician educators in the topic to respond to and evaluate the quality of the MCQs between the 2 methods.5 The findings showed that both approaches were comparable, with no meaningful difference between the AI-written questions and the human experts’ written questions, with a mean final score of –0.23 ± 0.56; P = .67. An interesting component and finding in this study was the time needed to write these questions. The human experts took approximately 10 hours to write the questions, compared with only 1 hour with the AI, including prompt creation and refinement.5 These findings were similar to those of another study by Cheung et al,7 which also reflected shorter time with AI compared with humans (20 minutes vs 211 minutes).7Finding time to write MCQ questions for a course as a clinician educator is a challenge, in addition to the special training needed to master and achieve competence in MCQ writing. Furthermore, Safadi and colleagues used a modified Delphi process to set a 15% margin of the maximum score to assess the noninferiority margin.5 Although their findings showed that they were comparable, the literature has shown mixed findings when it comes to the complexity, distracters, and depth of the questions when generated by AI compared with humans.8,9

Another outcome the authors evaluated was the ability to identify whether the question was written by AI or by a human. The findings showed that responders identified approximately 55% of the AI questions generated vs 51.8% of the human questions to be AI. Statistically, there was no significance difference and no prediction found to determine the type of method used to write the question.5

Although the use of AI might be intimidating to some educators, it is vital for clinicians and academic educators to catch the train before it leaves and to embrace these new tools, resources, and advancements. To that end, combining several key skills is recommended: (1) Competence in writing objectives and test questions; (2) choosing the best AI tool and model (eg, ChatGPT 3.5 vs GPT 5.0) for high reasoning (Safadi et al5 used model o1-preview 2024-09-12), and (3) competence in AI prompt writing.6,10 Also, adding a quality assurance step to validate the output of the AI tool by an expert will greatly ensure the validity and accuracy of the questions. These innovations could aid in providing faculty workload time for use in other areas, such as innovation, scholarship, pedagogy, and assessment.11 Key components to consider in recent AI literature are the inability to form high-level questions, reasoning, ethical issues, and AI hallucination. To limit these issues, the use of proper prompts as mentioned above, and similar to the process used by Safadi and colleagues,5 is an excellent starting point. The authors have provided the Python code that was used in this study as a supplement for future researchers’ use. The future of AI and large language models use in medical education for faculty raises many questions that need answering. For example, with the advancement of AI use, how will its usage among faculty change their workload? Second, should faculty development courses or workshops implement an AI prompt training for writing MCQ questions? Third, if there were change and reduction within the workload, did scholarship and innovation increase? Or more clinical time? These are example questions for future researchers to address to answer the question: Are we ready to hand over the pen?

抱歉!评论已关闭.

×
腾讯微博