Abstract
Purpose :
Recently, the applications of artificial intelligence (AI) in medicine have proliferated with advancements in deployment, image interpretation, and assistance in clinical decision making. Launched on November 30, 2022, ChatGPT is an AI chatbot created by OpenAI, that responds to users’ questions with human-like, algorithmic-driven responses. Here, we evaluated ChatGPT’s performance in answering American Academy of Ophthalmology (AAO) "Diagnose This Case" challenges.
Methods :
Ophthalmic clinical material was collated from the 2022 AAO "Diagnose This Case" challenges. Cases were categorized according to subspecialty, the relevant segment of the eye (anterior, posterior), and difficulty. The difficulty was approximated using the percentage of correct answers by prior respondents. Subsequently, each case’s question and answer were entered into ChatGPT. Because ChatGPT is unable to interpret images, an appropriate description of any relevant visual findings was provided where necessary. Descriptions were derived from the image interpretations in the associated answer explanations. Outputs were recorded and compared to the AAO answers and reasoning. The accuracy of ChatGPT was compared across categories. Significance (p < 0.05) was assessed using Fisher’s exact test, chi-squared test, and Spearman’s correlation coefficient.
Results :
ChatGPT was provided 51 clinical case challenges between December 22, 2022, and December 26, 2022. 56% of the outputs were correct overall, with greater performance for lower-difficulty questions (χ2= 6.42, p = 0.04). There was a positive correlation between the reported percentages of respondents who chose the correct answers and the answers provided by ChatGTP (r = 0.41, p = 0.004). There was a slightly higher performance on posterior segment cases (63%; n=24) compared to anterior segment cases (48%; n=27), however, this was not statistically significant (p = 0.40).
Conclusions :
Our study demonstrated a tendency for ChatGPT to select answers that were also selected by respondents in the ophthalmology case challenges. This trend is consistent with the AI’s intended design to generate human-like responses. Higher performance on posterior segment cases may reflect the initial focus of AI research on posterior segment disease, although this finding is limited by the sample size. Further research with expanded datasets can provide insights into performance between subspecialties.
This abstract was presented at the 2023 ARVO Annual Meeting, held in New Orleans, LA, April 23-27, 2023.