Abstract
Purpose :
DL techniques can be used to detect abnormalities in OCT B-scans. The performance of such an algorithm is dependent on quality of both data and its labels. The sources of data could be either from a clinical study or busy eye clinics. While the image quality, disease prevalence, age of subjects etc. can be well controlled within scope of a clinical study, the same may not be applicable when collected from the eye clinics. The involvement of multiple labelers may introduce inconsistencies while creating labels, because each expert has their own clinical judgment and differs depending on how and where (Primary, Secondary or Tertiary clinics) they practice. In this abstract, we discuss the effects of the above aspects on classification performance.
Methods :
To assess the effect of data source, we gathered macular OCT cube data during clinical studies and from eye clinics. Next, to measure the effect of labelling variability, the data were labelled at B-scan level by five labelers. Among five labelers, two of them (labelers X & Y in Fig. 2) had similar expertise levels practicing in the same hospital, while the rest (labelers A, B & C) were from three different eye clinics. Data from each labeler was split into training and test sets. An Inception_V1 model was developed on the training set for each labeler, and the performance was evaluated on all test sets. Fig 1(a) and 2(a) show the number of samples used for trainings and evaluations.
Results :
Fig 1(b) shows that evaluating a model using data from clinical trials does not always indicate good generalizability and may over-estimate the model accuracy. Training on ‘uncontrolled’ data sources leads overall to improved performance in a typical clinical setting, even if such a model underperforms in the clinical trial setting.
From Fig 2(b), we observe that models perform well on data labelled by experts with similar background. It shows strong differences in accuracy for abnormality prediction with labelers from different backgrounds, showing significant AUC drop.
Conclusions :
We conclude that prediction models are not easily transferable across labelers from different backgrounds. Furthermore, model accuracy from clinical trial model may not be transferrable to busy clinical environments.
This is a 2021 ARVO Annual Meeting abstract.