Abstract
Purpose :
Chat Generative Pre-trained Transformer (ChatGPT) is an artificial intelligence chatbot designed to solve problems while storing knowledge. As patients are more commonly using online resources to investigate their symptoms, we sought to analyze how accurate ChatGPT is in answering board-style ophthalmology questions and in diagnosing acute ocular problems.
Methods :
The American Academy of Ophthalmology question bank was used to provide 15 questions from 11 subspecialty categories. Feedback was given for incorrect answers, and the question was re-asked. Additionally, charts from the electronic medical record (EMR) were accessed during a one-month period for patients presenting with a new, acute ocular problem at the triage clinic. The chief complaint and elements of the ophthalmic exam were inputted to ChatGPT to generate a response. ChatGPT’s primary and differential diagnoses were checked against the ophthalmologist’s diagnosis, and the amount of incorrect versus correct responses were calculated for each patient encounter. Data was analyzed using the Fisher’s exact test and pairwise comparison.
Results :
The overall accuracy rate of ChatGPT’s first-attempt answers to board-style questions was 59% correct (95% CI = 0.51-0.67; p<0.001), which is significantly higher than chance. Accuracy rates varied from 40% (optics) to 73% (uveitis), with all but optics significantly more accurate than chance (p = 0.229). The overall accuracy of second-attempt answers was 24% correct (95% CI 0.14-0.36). There is an inverse correlation (Pearson correlation coefficient = -0.93) in accuracies between the first and second attempts. Combining all attempts, the average accuracy was 69% (95% CI=0.61-0.76). Regarding EMR diagnoses, the ChatGPT diagnosis matched the diagnosis given by the physician in clinic 67% of the time. The ChatGPT differential included the correct diagnosis 88% of the time. There was no difference in accuracy between specialties (neuro, plastics, cornea, retina, uveitis) using a Fisher’s exact test (all p>0.79).
Conclusions :
ChatGPT answers a slight majority of board-style questions correctly, but does not improve with feedback. ChatGPT is likely to include the correct diagnosis within its differential for clinical patients EMR, which is likely dependent on relevant history and exam elements being provided to ChatGPT.
This abstract was presented at the 2024 ARVO Annual Meeting, held in Seattle, WA, May 5-9, 2024.