Purchase this article with an account.
Adam Hanif, Ilkay Yildiz, Peng Tian, Beyza Kalkanli, Deniz Erdogmus, Stratis Ioannidis, Jennifer Dy, Jayashree Kalpathy-Cramer, Susan Ostmo, Karyn Jonas, R.V. Paul Chan, Michael F Chiang, J. Peter Campbell; Improved training efficiency for deep learning models using disease severity comparison labels. Invest. Ophthalmol. Vis. Sci. 2021;62(8):2108.
Download citation file:
© ARVO (1962-2015); The Authors (2016-present)
Neural network performance relies on large, high-quality training sets. In medical image recognition tasks, small datasets and high inter-labeler variance frequently limit models’ diagnostic accuracy. In this study, we compare the efficiency of training neural networks to predict disease severity using “comparison” labels versus the traditional method of using diagnostic “class” labels from a retinopathy of prematurity retinal image dataset.
100 fundus images were each assigned “class” labels indicating plus disease severity per the majority vote of 3 experts between either “Plus”, “Pre-plus” or “No Plus”. Additionally, all combinations of image pairs within the set were assigned “comparison” labels reflecting relative disease severity obtained from 5 experts (4950 labels total). Deep learning models were first trained with “class” labels from up to 60 randomly sampled images, and validated on a set of 20 images with “class” labels. Then, this process was repeated using “comparison” labels. All models were then evaluated on a test set of 5561 pre-labeled fundus images in two binary classification experiments: “Normal vs. Abnormal” and “Plus vs. Non-plus”. For each model, predictive performance was measured by area under the receiver operating curves (AUC).
For a given number of images, models trained on “comparison” labels consistently outperformed those trained on “class” labels. For the same number of labels, the performance of “class” and “comparison” labels was similar, but models trained on class labels exhibited wider confidence intervals by up to 0.2% in “Normal vs. Abnormal” experiments and 0.4% in “Plus vs. Non-plus” experiments (Figure 1).
"Comparison" labels are more informative per image than "class" labels. Further, the inherent subjectivity of "class" labels generates higher variability in model performance. This offers a solution for training highly accurate image classification models with fewer data.
This is a 2021 ARVO Annual Meeting abstract.
AUC of models trained with either “comparison” or “class” labels from sets of up to 60 corresponding images (A, B) or individual labels (C, D). Models' accuracy in disease severity prediction was assessed through binary image classification experiments: Normal vs. Abnormal (A, C) and Plus vs. Non-plus (B, D). Confidence intervals on reported metrics are indicated by the shaded region around the mean curve.
This PDF is available to Subscribers Only