Abstract
Purpose :
ChatGPT, a large language model-based chatbot, is an ever-evolving source of medical information, garnering wide interest and usage in recent years. Consequently, it is necessary to evaluate the correctness and readability of this information. Age-related macular degeneration (AMD) is a leading cause of severe vision loss in the United States. Our study aims to assess the appropriateness and readability of responses generated by ChatGPT-3.5 in regards to AMD.
Methods :
A retrospective cross-sectional study was performed; a list of 40 frequently asked questions about AMD was curated, categorized into general knowledge such as prevalence and visual impact (14), diagnostic and screening guidelines (10), and treatment options (16). Each question was inputted into the ChatGPT-3.5 platform. Two independent ophthalmologists graded the appropriateness of each response. Readability was assessed with the Flesch-Kincaid Grade Level, Flesch Reading Ease, and Gunning Fox Index, generated using an online readability tool.
Results :
Responses were appropriate in 85.7% (12/14), 100% (10/10), and 62.5% (10/16) of the general knowledge, screening, and treatment questions, respectively. Responses were incomplete for 31.3% and inappropriate for 6.2% of the treatment questions. Overall, ChatGPT-3.5 generated appropriate responses 80.0% (32/40), incomplete responses 17.5% (7/40), and inappropriate responses 2.5% (1/40) of the time. The average Flesch-Kincaid Grade Level, Flesch Reading Ease, and Gunning Fox Index were 12.0 ± 1.8, 39.9 ± 7.0, and 14.4 ± 2.3 for general knowledge responses, 12.7 ± 1.0, 31.3 ± 6.2, and 15.2 ± 1.0 for screening responses, and 13.3 ± 1.1, 27.2 ± 7.6, and 15.3 ± 1.3 for treatment responses.
Conclusions :
Overall, ChatGPT-3.5 responses were consistently appropriate about AMD general knowledge and screening, but underperformed about treatment options. When this study was conducted, ChatGPT-3.5 was only trained on information available through 2022. Thus, important updates and advances in treatments were excluded (i.e. biosimilars and dry AMD treatments). Readability of the responses were at the level of a high-school graduate or higher, while health-related information is recommended to be at a fifth to sixth grade reading level. Understanding the current limitations of AI guided responses in the context of health-related counseling and information sourcing is critical.
This abstract was presented at the 2024 ARVO Annual Meeting, held in Seattle, WA, May 5-9, 2024.