A Radiological Turing Test: Can Large Language Models Answer Board-Style Questions?

Saturday, May 6, 2023

First Author(s)

TR

Thomas P. Reith, MD

Resident Physician
University of Iowa
Coralville, Iowa, United States

Co-Author(s)

MD

Michael P. D'Alessandro, MD

Professor of Radiology
University of Iowa, United States

Purpose: For an artificial intelligence (AI) system to replace a radiologist, it must first pass the Qualifying (Core) Exam administered by the American Board of Radiology (ABR) as well as the USMLE Step 3 exam. ChatGPT is a powerful, recently released large language model that has generated a large amount of interest for its ability to provide detailed responses across a wide variety of subjects. This study assesses ChatGPT’s performance on board-style questions similar to those found on the ABR Core and USMLE Step 3 exams.

Methods/Materials: Sample Core Exam questions were collected from publicly-available ABR materials, the Aunt Minnie Board Review tool, and the RadPrimer question bank. Sample USMLE Step 3 questions were separately collected from the USMLE website. Questions containing visual media were discarded and the remainder were submitted verbatim to ChatGPT after a prompt instructing the software to pretend to be a radiologist or a medical professional taking a standardized exam. A new chat session was started for all questions, with the exception of related question sets.

Results: After exclusions, 127 radiology questions and 124 Step 3 questions were submitted to ChatGPT. 44.1% of radiology questions and 62.9% of Step 3 questions were answered correctly. ChatGPT’s superior performance on Step 3 questions was statistically significant: Χ2 (1, N = 251) = 8.9, p = 0.0028.

Since the typical Step 3 passing threshold is approximately 60% correct, ChatGPT's performance was around passing level. While no publicly-released data exists on the percentage of correct answers required to pass the ABR Core Exam, ChatGPT’s significantly inferior performance on radiology questions suggests it is not at passing level.

Conclusions: ChatGPT performed at passing level on USMLE Step 3 questions, but fared significantly worse on radiology board-style questions. These results suggest that large language models are increasingly adept at answering general questions involving clinical medicine, but may struggle in more specialized areas like radiology. Nevertheless, as the ever-increasing competence of artificial intelligence transforms the medical field, the radiology community should keep abreast of new developments.