Investigative Ophthalmology & Visual Science Cover Image for Volume 65, Issue 7
June 2024
Volume 65, Issue 7
Open Access
ARVO Annual Meeting Abstract  |   June 2024
EyChat: Evaluating GPT-4's Role in Enhancing Ophthalmology Learning and Clinical Validation
Author Affiliations & Notes
  • Kabilan Elangovan
    Artificial Intelligence Office, SingHealth Group, Singapore, Singapore, Singapore
    AI and Digital Health, Singapore Eye Research Institute, Singapore, Singapore, Singapore
  • Ting Fang Tan
    AI and Digital Health, Singapore Eye Research Institute, Singapore, Singapore, Singapore
  • Liyuan Jin
    Duke-NUS Medical School, Singapore, Singapore, Singapore
  • Laura Gutierrez
    AI and Digital Health, Singapore Eye Research Institute, Singapore, Singapore, Singapore
  • Daniel Ting
    Singapore National Eye Centre, Singapore, Singapore, Singapore
    Artificial Intelligence Office, SingHealth Group, Singapore, Singapore, Singapore
  • Footnotes
    Commercial Relationships   Kabilan Elangovan None; Ting Fang Tan None; Liyuan Jin None; Laura Gutierrez None; Daniel Ting EyRIS, Code P (Patent)
  • Footnotes
    Support  None
Investigative Ophthalmology & Visual Science June 2024, Vol.65, 2357. doi:
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Kabilan Elangovan, Ting Fang Tan, Liyuan Jin, Laura Gutierrez, Daniel Ting; EyChat: Evaluating GPT-4's Role in Enhancing Ophthalmology Learning and Clinical Validation. Invest. Ophthalmol. Vis. Sci. 2024;65(7):2357.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose : To assess the effectiveness of ophthalmology-focused large language models (LLMs), we used GPT-4
for evaluation, applying a specific ophthalmic care rubric. The goal is to check GPT-4’s alignment
with clinicians' assessments, aiming to potentially automate clinical validation in ophthalmology AI
applications. This study is key in establishing the clinical accuracy of LLMs and GPT-4's reliability in
ophthalmology.

Methods : Our study involved creating a 368 Q&A dataset on ophthalmology, covering diseases like Cataract, Diabetic Retinopathy, Glaucoma, and Refractive Errors, initially crafted with ChatGPT and later refined by ophthalmologists. We fine-tuned
five large language models (LLMs) for this, including versions of LLAMA2 and GPT3.5. Their
performance was assessed using 20 separate Q&A pairs and graded by GPT-4 with a scoring rubric.
Additionally, three clinicians of varying experience levels ranked the LLMs' answers. To gauge clinical
alignment, we compared these rankings with GPT-4’s scores using Pearson coefficients and Cohen's Kappa.

Results : In our study, GPT-4 evaluated ophthalmology-focused language models. GPT3.5 ranked highest (0.96), followed by LLAMA2-13B-chat (0.93) and others. Comparisons with a medical student, ophthalmology resident, and ophthalmologist revealed strong correlations (0.90) between GPT-4's ratings and the medical student's and ophthalmologist's assessments, but less so with the resident (Cohen's Kappa of 0.00). This study demonstrates that GPT-4 can effectively align with human clinical judgment in ranking ophthalmology-focused LLMs, particularly with evaluations by experienced medical professionals. The findings are significant, highlighting the potential of GPT-4 in supporting and enhancing clinical assessments in ophthalmology.

Conclusions : The study finds that GPT-4 aligns well with human clinical judgments in ophthalmology LLM ranking,
as shown by high Spearman correlations. However, Cohen's Kappa scores indicate moderate
agreement with medical students and ophthalmologists (0.50) but low agreement with residents
(0.00). This suggests GPT-4's usefulness in clinical validation for ophthalmology, especially with
experienced clinicians, while highlighting the need for further refinement with less experienced
practitioners.

This abstract was presented at the 2024 ARVO Annual Meeting, held in Seattle, WA, May 5-9, 2024.

 

GPT-4 Ranking vs Clinician Ranking

GPT-4 Ranking vs Clinician Ranking

 

Spearman Correlation and Cohen’s Kappa for each Evaluator’s comparison with GPT-4

Spearman Correlation and Cohen’s Kappa for each Evaluator’s comparison with GPT-4

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×