Abstract
Purpose :
Generative Pre-training Transformer (GPT)-based large language models have made significant advancements in various domains, but their ability to effectively process medical images is limited. Applying ChatGPT in the realm of clinical ophthalmology could reduce the workload and improve patient care. We aimed to develop an ocular ultrasound visual question answering (VQA) model with the help of ChatGPT, for facilitating the interpretation of ultrasound reports.
Methods :
We collected information from ocular ultrasound reports written by experienced physicians and utilized ChatGPT to create question-answer (QA) pairs in various question types. The QA pairs underwent quality control filtering and were subsequently employed to fine-tune a multi-modal transformer model for performing VQA and report generation. The performance of the VQA was evaluated using language-based metrics such as the Bilingual Evaluation Understudy (BLEU), the Consensus-based Image Description Evaluation (CIDEr), the Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence (ROUGE-L), and the Semantic Propositional Image Caption Evaluation (SPICE), as well as classification metrics on question types and disease conditions. In addition, one ophthalmologist subjectively reviewed 100 images from the test set to improve the model’s performance.
Results :
Our study included 6,073 ocular ultrasound reports from distinct patients, covering 42 disease-related conditions. ChatGPT produced 101,417 QA pairs derived from the reports to refine our ultrasound-VQA model, which achieved BLEU scores (1-4) of 0.58, 0.54, 0.52, 0.5, ROUGE of 0.57, SPICE of 0.51, and CIDEr of 2.54. The accuracies for binary-choice and multiple QAs were 0.89 and 0.77, respectively. Manual assessment of 100 images (2185 QA pairs) identified 6 (0.3%) QA pairs with unrelated information, 146 (6.7%) with apparent factual errors, and 32 (1.5%) containing insufficient information to provide an answer.
Conclusions :
This study demonstrated the effectiveness and potential of utilizing ChatGPT for VQA tasks on ultrasound images. By leveraging generative learning, vision-language pretraining, and consideration of limitations, we pave the way for unveiling the feasibility of large language models for medical image analysis.
This abstract was presented at the 2024 ARVO Annual Meeting, held in Seattle, WA, May 5-9, 2024.