Abstract
Purpose :
Publicly available datasets used to train artificial intelligence (AI) models for the detection of glaucoma utilize various, often unspecified methods to determine ground truth of the presence or absence of glaucoma based on fundus images. Accurate determination of ground truth is essential for training valid AI models for glaucoma detection. The purpose of the study is to validate ground truth of presence or absence of glaucoma as labeled in fundus images from 20 publicly available glaucoma datasets.
Methods :
Two datapoints with labeled ground truth of ‘glaucoma’, two with labeled ground truth of ‘no glaucoma’, were randomly sampled from 20 datasets; 3 of the 20 only provided a single label, for a total of 74 validation instances. All available metadata was removed, and graders were masked to the labeled reference standard. Graders independently evaluated each image for VCDR, presence of peripapillary atrophy, presence of retinal nerve fiber layer defect, presence of optic disc hemorrhage, integrity of the neuroretinal rim (presence of notching), and evaluation of the ISNT rule. Based on evaluation of all features, presence or absence of glaucoma was determined. Where graders disagreed, discussion of each feature and final diagnosis was undertaken. Agreement between graders and agreement of graders with labeled ground truth for each image was determined by percent agreement and Cohen’s Kappa coefficient.
Results :
Annotator agreement and kappa score between graders on the diagnosis of glaucoma based on fundus images were 79.05% & 0.52 which improved following discussion to 97.72% & 0.95, respectively. Mean agreement of graders with labeled reference standard and the corresponding kappa coefficient were 75.33% & 0.52 that improved to 77.02% & 0.54 post-discussion. Following discussion, 8 datasets had 100% and 5 datasets had 50% agreement with both graders.
Conclusions :
Agreement of presence or absence of glaucoma based on six pre-specified clinical features between expert clinical graders was very high; while agreement with established ground truth of publicly available datasets varied greatly between datasets. Consistent, established, and clearly described protocols for evaluation of labeling of fundus images in publicly available datasets used in model development for the detection of glaucoma is necessary prior to model training and potential clinical deployment.
This abstract was presented at the 2023 ARVO Annual Meeting, held in New Orleans, LA, April 23-27, 2023.