Abstract
Purpose :
Many existing works proved to be effective for detecting some specific retinal diseases, e.g., DR grading. However, those methods may fail to detect multiple retinal diseases simultaneously. In this study, we build the hierarchical information between retinal diseases in a deep learning model to improve the performance of multiple retinal disease recognition by utilizing a vision-language model.
Methods :
This study involved more than one million collected fundus images with 53 kinds of retinal conditions/findings collected from private hospitals over the time of 10 years. We adopt CLIP (https://github.com/openai/CLIP) as the backbone of our vision-language model. Two training strategies were developed for the comparison study: (1) To build the image-text paired inputs, we designed a 3-level hierarchical caption as the inputs of language model for each corresponding fundus image, e.g., “An image of mild non-proliferative diabetic retinopathy (low level), diabetic retinopathy (middle level), vessel (high level).” (2) We used caption without hierarchical information to train the baseline model, e.g., “An image of mild non-proliferative diabetic retinopathy.”. The CLIP was tuned on the privately collected dataset and externally evaluated on the public ODIR dataset for the 12 retinal disease recognition: normal, DR (mild/severe/moderate NPDR or PDR), cataract, glaucoma, hypertensive retinopathy, dry/wet AMD, pathological myopia and other conditions.
Results :
Hierarchical training significantly improves detection accuracy compared to the baseline model. Notable enhancements include DR grading, which improved from 93.56% AUC to 96.32% AUC, cataract detection from 97.23% AUC to 97.92% AUC, glaucoma detection from 89.43% AUC to 92.45% AUC, hypertensive retinopathy from 90.51% AUC to 91.37% AUC, dry/wet AMD detection from 95.28% AUC to 97.19% AUC, pathological myopia detection from 94.97% AUC to 95.82% AUC, and other conditions detection from 93.82% AUC to 94.06% AUC.
Conclusions :
The integration of vision-language models, utilizing image and paired text information, demonstrates promising performance in multiple retinal disease recognition. The hierarchical caption design further elevates the model's effectiveness. Future research will explore extending this approach to other modalities, such as OCT images, for practical clinical translation.
This abstract was presented at the 2024 ARVO Annual Meeting, held in Seattle, WA, May 5-9, 2024.