Purchase this article with an account.
Joy Hsu, Sonia Phene, Jieying Luo, Akinori Mitani, Naama Hammel, Jonathan Krause, Rory Sayres; Identifying and mitigating low-quality labels for deep learning in glaucoma. Invest. Ophthalmol. Vis. Sci. 2020;61(7):4537.
Download citation file:
© ARVO (1962-2015); The Authors (2016-present)
Variability in diagnosis is well-established in many clinical tasks. This variability may be magnified in nuanced tasks like grading glaucomatous optic neuropathy (GON) from fundus images. The impact of label quality (LQ) on deep learning systems (DLS) is not well characterized.How can we identify and mitigate the effects of low-quality labels? Can mitigation strategies reduce the labelling burden for developing a DLS?
We examined label quality in data used to train a previously-reported DLS for GON (Phene et al., 2019). We used one iteration of co-teaching to identify potential low-quality labels: half the training data were used to train a DLS, and predict labels for the other half. We derived a quality score (QS) for each training case based on the confidence of the DLS predicted score. The 1300 cases with the lowest QS were flagged for relabeling by a glaucoma specialist.In order to determine if QS could be used to reduce the labelling burden, we repeated the co-teaching process on a subset of 40k out of 75k patients. We trained a DLS on the 40k labels, as well as the 30k high-QS labels only. We compared both DLS against a baseline model trained using the full 75k label set.
When relabeling images with low QS, glaucoma specialists agreed with the model, and not the initial label, 85% of the time. This suggests that most low-quality labels may result from grader lapses.Excluding low-QS labels from training produced a DLS that performed comparably to our baseline model while requiring fewer total labels. A DLS trained on 30k high-QS labels derived from co-teaching a total set of 40k was non-inferior to our baseline DLS (AUC 0.927 vs. 0.933 on our internal validation set, p < 0.01, DeLong test with a non-inferiority margin of 2%). By contrast, a DLS trained on the 40k subset, including low-QS labels, had inferior performance (AUC 0.907, p > 0.7) compared to our baseline DLS.
Low label quality may substantially affect DLS performance, and may increase the cost of developing DLS by requiring more labels. Within the context of GON grading, low-quality labels largely derive from grader errors. Co-teaching can be an effective way to identify suspected low-quality labels; omitting these labels during training may be an effective mitigation strategy for low label quality.
This is a 2020 ARVO Annual Meeting abstract.
Illustration of the co-teaching strategy used in this study.
This PDF is available to Subscribers Only