Purchase this article with an account.
Praveer Singh, J. Peter Campbell, Susan Ostmo, James Brown, Szu-Yeu Hu, Nathaphop Chaichaya, Phanthipha Wongwai, Somkiat Asawaphureekorn, Sirinya Suwannaraj, Michael Morley, Parag Shah, Narendran Venkatapathy, Robison Vernon Paul Chan, Michael F Chiang, Jayashree Kalpathy-Cramer; External validation of a deep learning algorithm for plus disease classification on a multinational ROP dataset. Invest. Ophthalmol. Vis. Sci. 2021;62(8):3266.
Download citation file:
© ARVO (1962-2015); The Authors (2016-present)
Deep learning (DL) algorithms have been shown to perform well for classifying plus disease in ROP. However it is common for DL algorithms to have reduced performance on external datasets compared to the datasets that they were trained on. In this study, we demonstrate the efficacy of a DL algorithm, trained on a North American population, on two external multinational datasets.
Retcam images were obtained from India and Thailand through databases hosted by partner institutions, Aravind Eye Hospital (AEH) & Khon Kaen University (KKU) respectively. After filtering out images with inferior quality, Indian dataset consisted of 8811 images captured from 1275 eye-exams, while the Thai dataset had 1299 images from 385 eye-exams all from at risk infants. Both the Indian and Thai datasets were additionally labelled by 2-3 North American experts and gold standards were obtained through mutual consensus among all raters for each dataset. The performance of the iROP-DL model, trained on Retcam images from American population, was evaluated on both the external Retcam datasets after screening out all the non posterior-pole (PP) images.
The two external datasets included many images which were out of distribution compared to the original training and testing iROP population (multiple views of the retina, anterior segment photos, samples with considerable pigmentation), and thus presented challenges for evaluation of the algorithm. The Table shows low performance before PP-filtering (AUC’s 0.88 & 0.78 for India and Thai respectively), which improved considerably after PP-filtering (AUC’s 0.89 & 0.84) and later by using consensus labels (AUC’s 0.97 & 0.95). As shown via UMAPs in Figure, similar to iROP (yellow), AEH (blue) and KKU (red) also have Normal, pre-plus and plus feature points properly aligned in space (resulting in excellent performance), though they are segregated from iROP owing to demographic differences.
Applying DL algorithms on external datasets is prone to challenges due to demographic or phenotypic differences, or differences in acquisition methodology. After PP-filtering, we demonstrate excellent performance for the i-ROP DL system on the international datasets compared to the original test set. UMAP visualization further substantiates our point and highlights segregation of the external datasets owing to remaining ethnic/phenotypic differences.
This is a 2021 ARVO Annual Meeting abstract.
This PDF is available to Subscribers Only