Abstract
Purpose :
Too many people in the world are visually impaired by glaucoma, largely because the disease is detected too late. Aim: to build a labeled dataset for training an AI algorithm for glaucoma screening by fundus photography. To assess the accuracy of the graders and to characterize the features of all eyes with referable glaucoma.
Methods :
Color fundus photographs of 113,897 eyes were obtained from EyePACS, California, USA, from a population screening program for diabetic retinopathy. Carefully selected graders (ophthalmologists and optometrists) graded the images. To qualify, they had to pass the EODAT1 optic disc assessment with at least 85% accuracy and 92% specificity. Of 89 candidates, 30 passed. Each image of the EyePACS set was then scored by varying random pairs of graders as ‘Referable glaucoma’ (RG), ‘No referable glaucoma’ or ‘Ungradable’. In case of disagreement, a glaucoma specialist made the final grading. RG was scored if visual field damage was expected. In case of RG, graders were instructed to mark up to 10 relevant glaucomatous features. 1Reus N et al.; Ophthal 2010 117(4):717-23.
Results :
During the grading, the performance of each grader was monitored; if the sensitivity and specificity dropped below 80 and/or 95%, respectively (the final grade served as reference), they exited the study and their gradings were redone by other graders. In all, 20 graders qualified; their mean sensitivity and specificity (SD) were 85.6 (5.7) % and 96.1 (2.8) %, respectively. The two graders agreed in 92.45% of the images (Gwet’s AC2, expressing the inter-rater reliability, was 0.917). Of all gradings, the sensitivity and specificity (95% CI) were 86.0 (85.2 – 86.7)% and 96.4 (96.3 – 96.5)%, respectively. Of all gradable eyes (n = 111183; 97.62 %) the prevalence of RG was 4.38 %. The most common features of RG were the appearance of the neuroretinal rim inferiorly and superiorly (Figure (top) for all features and their probabilities. Conditional probabilities are also shown (bottom)).
Conclusions :
The estimated sensitivity and specificity was above our target of 80% and 95%, respectively, and the annotated dataset should therefore be of sufficient quality to develop AI screening solutions.
This abstract was presented at the 2022 ARVO Annual Meeting, held in Denver, CO, May 1-4, 2022, and virtually.