Abstract
Purpose :
AI-based measurements are becoming more common in clinical routine and clinical trial analysis. Poor image quality or obfuscation of features due to OCT-based artifacts can negatively impact the reliability of automated biomarker measurements. This study evaluates the efficacy of a novel deep learning (DL) algorithm in detecting image quality in OCT scans to measure retinal thickness. Our research aims to validate the potential of this DL model as a reliable tool for identifying scans with poor image quality, thereby enhancing the diagnostic accuracy and utility of OCT imaging.
Methods :
The model (M) is designed to annotate retinal layers. It provides a segmentation of layer lines as well as the uncertainty of the prediction along A-scans. B-scan gradability is determined by a composite of the uncertainties (90th percentile, 3 lines) that exceeds a threshold. A real-world dataset was gathered from the macular clinic of the Vienna General Hospital between 2007 and 2018. From a registry of 128,000 scans acquired with a Cirrus (Carl Zeiss Meditec) scanner, 504 were randomly sampled by stratifying on acquisition year, with image quality varying. 2 experienced readers (R1, R2) independently assessed whether 816 B-scans were gradable for measuring retinal thickness, and if deemed ungradable, selected reason(s) why. Initial threshold was determined on a 504 B-scan training set and validated on a 200 B-scan test set. Finally, 112 were graded as a confirmation study on positively bad B-scans.
Results :
The model achieved mean specificity of 0.99 on test set and agreement of 0.69 in confirmation study. Out of 191 test scans marked gradable by M, 173 (91%) were in accordance with R1+R2. Inter-reader gradability agreement was high across training and test sets (0.91/0.92) but dropped in confirmation study (0.80). Similarly, the agreement of M with R1+R2 was high across training and test sets (0.89/0.90) and reduced in confirmation study (0.59). 21 test scans with disagreement between M, R1, and R2 were qualitatively assessed. 17 were attributed to R1/R2 ambivalence, 4 as false positive cases.
Conclusions :
A DL model was used to identify scans with insufficient quality for AI-based analysis using the uncertainty in layer predictions. Our findings show that these models can be used to efficiently identify problematic cases from large datasets and are suitable as a robust tool for OCT image quality assessment.
This abstract was presented at the 2024 ARVO Annual Meeting, held in Seattle, WA, May 5-9, 2024.