Abstract
Purpose :
Automatic diagnosis of ocular anomalies using fundus photographs has shown great promises to scale-up screening. However, the generalization of the automated algorithms to new data which are very different from the training data, i.e issued from a different population, is a challenging problem. The objective of this study is to assess the generalizability of an AI algorithm on four datasets. Each dataset was collected from a specific population and annotated for a pre-defined set of ocular anomalies.
Methods :
Four datasets were considered: OPHDIAT (France, diabetic population, 77,827 images), OphtaMaine (France, general population, 17,120 images), RIADD (India, general population, 3,200 images) and ODIR (China, general population, 7,000 images). In order to unify the ground-truth annotations, the annotation of each dataset was analyzed and converted into the ODIR annotation class system: Normal, Diabetes, Glaucoma, Cataract, AMD, Hypertension, Myopia and Other anomalies. Each dataset was then split into a training, a validation and a testing subset. Different scenarios were studied: the AI algorithm was trained using one of the four training subset and then tested on all the four testing subsets. In addition, AI was trained on the whole four training subsets (joint model). The AI algorithm was evaluated using the mean Area under the receiver operating characteristic curve (mAUC): the AUC was calculated independently for each pathology and then the average was computed.
Results :
On OphtaMaine, the mAUC was 0.8799 for the AI trained on OphtaMaine, the best mAUC obtained without training on OphtaMaine was 0.8341 and the mAUC was 0.9338 for the joint model. On RIADD, the AI trained on RIADD reached a mAUC of 0.9164, the best mAUC obtained without training on RIADD was 0.8680, and the mAUC was 0.9169 for the joint model. On ODIR, the mAUC was 0.8803 for the AI trained on ODIR, the best mAUC obtained without training on ODIR was 0.8284, and the mAUC was 0.8865 for the joint model.
Conclusions :
The performances of the AI algorithm trained on a specific dataset were good when tested on data coming from the same population. However, when tested on different datasets, the performances of the AI algorithm degraded. This highlights the variability of experts interpretations among the four datasets. An AI trained on the four datasets performed better on the small datasets.
This abstract was presented at the 2022 ARVO Annual Meeting, held in Denver, CO, May 1-4, 2022, and virtually.