Abstract
Purpose :
To assess the precision and accuracy of Chat Generative Pre-trained Transformer-4 with vision (GPT-4 Turbo with vision, GPT-4V, OpenAI) in diagnosing common vitreoretinal diseases in a real-world ophthalmology setting.
Methods :
A retrospective chart review was conducted on patients diagnosed with the fifteen most common vitreoretinal diseases at Bascom Palmer Eye Clinic from January 2010 to March 2023. Representative patient cases were created using clinical scenarios and retinal images from their initial visits. The model's accuracy in generating diagnoses and corresponding International Classification of Diseases (ICD-10) codes was assessed using open-ended questions (OEQ) and multiple-choice questions (MCQ). The images were divided into two groups, A and B, based on the availability of adequate clinical information. Group A comprised simple cases, while Group B consisted of challenging cases for diagnosis. The accuracy of responses was independently assessed by three retina specialists.
Results :
A total of 256 eyes from 143 patients, along with their clinical histories and images, were analyzed using the GPT-4V platform. Diagnostic responses were accurate in 13.7% (OEQ) and 31.3% (MCQ) (p < 0.001). For ICD-10 responses, accuracy was 5.5% (OEQ) and 31.3% (MCQ) (p < 0.001). Notable correct diagnoses included posterior vitreous detachment (PVD, OEQ=100%, MCQ=100%), non-exudative age-related macular degeneration (NEAMD, OEQ=55%, MCQ=65%), and retinal detachment (RD, OEQ=29.4%, MCQ=64.7%). In ICD-10 responses, NEAMD (55%), central retinal vein occlusion (6.3%), and macular holes (6%) were the most accurately diagnosed conditions for OEQ, while PVD (100%), NEAMD (65%), and RD (64.7%) topped the list for MCQ. Subgroup analyses showed no statistically significant differences between groups A and B for diagnostic and corresponding ICD-10 responses with both OEQ and MCQ (p ≥ 0.399).
Conclusions :
The AI-based ChatGPT-4V model holds promise in improving efficiency in clinical care and medical record-keeping. While it performs well with standardized multiple-choice questions, its effectiveness decreases in free-response scenarios, primarily due to the complexities and variability inherent in real medical cases, particularly in retina clinics. This underscores a significant limitation of the tool in providing advice on ocular health matters.
This abstract was presented at the 2024 ARVO Annual Meeting, held in Seattle, WA, May 5-9, 2024.