Abstract
Purpose :
Patients are seeking out medical advice on publicly available learning models. Medical advice from these learning models has yet to be fully evaluated. In this study we aim to evaluate the medical advice of four learning models and assess physician and non-physician perception of the advice.
Methods :
11 common eye symptoms from American Academy of Ophthalmology online website, "For Public & Patient" section, were put into four publicly available learning models (Chat GPT-3.5, Chat GPT 4.0, Bing, and Bard) in August 2023. Follow-up questions included etiology therapy, and follow-up with the physician. The answers were anonymized and graded by 6 evaluators (2 ophthalmologists, 2 emergency medicine physicians, and 2 non-medical persons with graduate degrees) on a 5-point Likert scale in three categories: accuracy, helpfulness, and specificity. The results were analyzed to determine the quality of response from each model.
Results :
Average ratings for each learning model by each grader are shown in Figure 1. Overall Chat GPT 3.5 was the highest rated for all 6 graders and had the highest average. Bard had the lowest rated average, Chat GPT 3.5 was selected by evaluators as having the best response for 6/11 of the questions. Chat GPT 4.0, Bing and Bard were selected by the evaluators for each having the best response to 2 of the questions each.
Spearman’s correlation between the two non-medical persons was 0.51; between the two emergency medicine physicians was 0.49; and between two ophthalmologist was 0.63 (Figure 2). ANOVA of the responses between non-ophthalmologists was significant (p<0.005) but between physicians was not significant ( p>0.05).
Conclusions :
Overall, Chat GPT 3.5 tends to provide the best advice at present. While within each group there was moderate to strong correlation on how each grader perceived the advice, non-medical persons evaluated the advice differently than physicians.
This abstract was presented at the 2024 ARVO Annual Meeting, held in Seattle, WA, May 5-9, 2024.