Abstract
Purpose :
To finetune a multi-modal, transformer-based model for generating medical reports from slit lamp images and develop a subquent question answering (QA) system using Llama2. We term this entire process slit-lamp-GPT (Generative Pre-training Transformer).
Methods :
Our research utilized a dataset of 25,051 slit-lamp images from 3,409 participants, paired with their corresponding physician-created medical reports. We used this data, which was divided into training, validation, and test sets, to finetune the Bootstrapping Language-Image Pre-training (BLIP) framework towards report generation. The generated text reports and human-posed questions were then inputted into Llama2 for interactive question answering. We evaluated performance using qualitative metrics (including BLEU, CIDEr, ROUGE-L, SPICE, accuracy, sensitivity, specificity, precision, and F1-score) and the subjective assessments from two experienced ophthalmologists who rated the outputs on a 1-3 scale (1 indicated high quality).
Results :
A total of 50 conditions related to diseases or post-operative complications through keyword matching in initial reports were identified. The refined slit-lamp-GPT model exhibited BLEU scores (1-4) of 0.67, 0.66, 0.65, and 0.65, respectively, with a CIDEr score of 3.24, a ROUGE score of 0.61, and a SPICE score of 0.37. The most frequently identified conditions were cataract (unspecific categorization) (22.9%), age-related cataract (22.0%), and conjunctival concretion (13.1%). Disease classification metrics showed an overall accuracy of 0.82 and an F1 score of 0.64, with high accuracies (≥0.9) for identifying intraocular lens, conjunctivitis (unspecific categorization), and chronic conjunctivitis, and high F1 scores (≥0.9) for cataract and age-related cataract. A high level of agreement was noted between the two ophthalmologists during the quality assessment of 100 reports with scores of 1.36 for both completeness and correctness. Consistency was also observed in an interactive question answering scenario involving 300 generated answers, with scores of 1.33, 1.14, and 1.15 for completeness, correctness, and possible harm, respectively.
Conclusions :
This pioneering study introduces the slit-lamp-GPT model for report generation and question answering, highlighting the potential of large language models to assist ophthalmologists and patients.
This abstract was presented at the 2024 ARVO Annual Meeting, held in Seattle, WA, May 5-9, 2024.