Abstract
Purpose :
To assess the effectiveness of ophthalmology-focused large language models (LLMs), we used GPT-4
for evaluation, applying a specific ophthalmic care rubric. The goal is to check GPT-4’s alignment
with clinicians' assessments, aiming to potentially automate clinical validation in ophthalmology AI
applications. This study is key in establishing the clinical accuracy of LLMs and GPT-4's reliability in
ophthalmology.
Methods :
Our study involved creating a 368 Q&A dataset on ophthalmology, covering diseases like Cataract, Diabetic Retinopathy, Glaucoma, and Refractive Errors, initially crafted with ChatGPT and later refined by ophthalmologists. We fine-tuned
five large language models (LLMs) for this, including versions of LLAMA2 and GPT3.5. Their
performance was assessed using 20 separate Q&A pairs and graded by GPT-4 with a scoring rubric.
Additionally, three clinicians of varying experience levels ranked the LLMs' answers. To gauge clinical
alignment, we compared these rankings with GPT-4’s scores using Pearson coefficients and Cohen's Kappa.
Results :
In our study, GPT-4 evaluated ophthalmology-focused language models. GPT3.5 ranked highest (0.96), followed by LLAMA2-13B-chat (0.93) and others. Comparisons with a medical student, ophthalmology resident, and ophthalmologist revealed strong correlations (0.90) between GPT-4's ratings and the medical student's and ophthalmologist's assessments, but less so with the resident (Cohen's Kappa of 0.00). This study demonstrates that GPT-4 can effectively align with human clinical judgment in ranking ophthalmology-focused LLMs, particularly with evaluations by experienced medical professionals. The findings are significant, highlighting the potential of GPT-4 in supporting and enhancing clinical assessments in ophthalmology.
Conclusions :
The study finds that GPT-4 aligns well with human clinical judgments in ophthalmology LLM ranking,
as shown by high Spearman correlations. However, Cohen's Kappa scores indicate moderate
agreement with medical students and ophthalmologists (0.50) but low agreement with residents
(0.00). This suggests GPT-4's usefulness in clinical validation for ophthalmology, especially with
experienced clinicians, while highlighting the need for further refinement with less experienced
practitioners.
This abstract was presented at the 2024 ARVO Annual Meeting, held in Seattle, WA, May 5-9, 2024.