June 2020
Volume 61, Issue 7
Open Access
ARVO Annual Meeting Abstract  |   June 2020
Identifying and mitigating low-quality labels for deep learning in glaucoma
Author Affiliations & Notes
  • Joy Hsu
    Google, California, United States
  • Sonia Phene
    Google, California, United States
  • Jieying Luo
    Google, California, United States
  • Akinori Mitani
    Google, California, United States
  • Naama Hammel
    Google, California, United States
  • Jonathan Krause
    Google, California, United States
  • Rory Sayres
    Google, California, United States
  • Footnotes
    Commercial Relationships   Joy Hsu, Google (E); Sonia Phene, Google (E); Jieying Luo, Google (E); Akinori Mitani, Google (E); Naama Hammel, Google (E); Jonathan Krause, Google (E); Rory Sayres, Google (E)
  • Footnotes
    Support  None
Investigative Ophthalmology & Visual Science June 2020, Vol.61, 4537. doi:
  • Views
  • Share
  • Tools
    • Alerts
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Joy Hsu, Sonia Phene, Jieying Luo, Akinori Mitani, Naama Hammel, Jonathan Krause, Rory Sayres; Identifying and mitigating low-quality labels for deep learning in glaucoma. Invest. Ophthalmol. Vis. Sci. 2020;61(7):4537.

      Download citation file:

      © ARVO (1962-2015); The Authors (2016-present)

  • Supplements

Purpose : Variability in diagnosis is well-established in many clinical tasks. This variability may be magnified in nuanced tasks like grading glaucomatous optic neuropathy (GON) from fundus images. The impact of label quality (LQ) on deep learning systems (DLS) is not well characterized.

How can we identify and mitigate the effects of low-quality labels? Can mitigation strategies reduce the labelling burden for developing a DLS?

Methods : We examined label quality in data used to train a previously-reported DLS for GON (Phene et al., 2019). We used one iteration of co-teaching to identify potential low-quality labels: half the training data were used to train a DLS, and predict labels for the other half. We derived a quality score (QS) for each training case based on the confidence of the DLS predicted score. The 1300 cases with the lowest QS were flagged for relabeling by a glaucoma specialist.

In order to determine if QS could be used to reduce the labelling burden, we repeated the co-teaching process on a subset of 40k out of 75k patients. We trained a DLS on the 40k labels, as well as the 30k high-QS labels only. We compared both DLS against a baseline model trained using the full 75k label set.

Results : When relabeling images with low QS, glaucoma specialists agreed with the model, and not the initial label, 85% of the time. This suggests that most low-quality labels may result from grader lapses.

Excluding low-QS labels from training produced a DLS that performed comparably to our baseline model while requiring fewer total labels. A DLS trained on 30k high-QS labels derived from co-teaching a total set of 40k was non-inferior to our baseline DLS (AUC 0.927 vs. 0.933 on our internal validation set, p < 0.01, DeLong test with a non-inferiority margin of 2%). By contrast, a DLS trained on the 40k subset, including low-QS labels, had inferior performance (AUC 0.907, p > 0.7) compared to our baseline DLS.

Conclusions : Low label quality may substantially affect DLS performance, and may increase the cost of developing DLS by requiring more labels. Within the context of GON grading, low-quality labels largely derive from grader errors. Co-teaching can be an effective way to identify suspected low-quality labels; omitting these labels during training may be an effective mitigation strategy for low label quality.

This is a 2020 ARVO Annual Meeting abstract.


Illustration of the co-teaching strategy used in this study.

Illustration of the co-teaching strategy used in this study.


This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.