Abstract
Purpose :
To identify clinical and subclinical features that adversely impact the accuracy of retinal fluid measurements made by expert annotators and explore their influence on the ability of AI tools (RetInSight Fluid Monitor - RET FM) to provide accurate and consistent results.
Methods :
Two expert reading groups each annotated 48 real-world Triton OCT scans (3,072 B-scans, 64 per volume) from nAMD patients. Annotations were created according to the same protocol. Each B-scan was assigned labels of descriptive clinical and sub-clinical features (i.e. vessel shadows or thickening of the outer nuclear layer). Inter-reader variability, measured by Dice, was analyzed within labelled derived subgroups. The Kruskal Wallis test was used to identify features associated with significantly higher inter-reader variability. The efficacy of the RET-FM was assessed within feature subgroups. Model predictions were compared with annotations from both reader groups, and with the regions of agreement between groups.
Results :
Annotations were characterized and compared across feature subgroups. Annotation styles were found to differ between the annotator groups with one annotator group marking 32% more pixels on average. This increase was consistent across feature subgroups. Biomarker measurement reliability was significantly impacted by 18 of 30 assessed features. Vessel shadows, pseudo-drusen and thickening of the outer nuclear layer were associated with the largest variability. The RET-FM demonstrated high sensitivity (IRF: 0.90, SRF: 0.97, PED: 0.93) on the consensus regions between reading groups on pixel-level annotations. Qualitative analysis of regions with low inter-reader agreement suggests this is likely due to reduced biomarker visibility and uncertainty in feature presentation. AI-based segmentation showed consistent performance in annotation decision in these scenarios.
Conclusions :
Annotation quality is dependent on the annotation protocol, annotation style or readers and reader experience. Even experienced readers disagree on how to annotate in the presence of confounding features. Our study identified 18 sub-clinical features which impact accuracy of biomarker measurement. AI tools, such as RET-FM can provide critical guidance in such challenging cases by capturing key regions and handling challenges caused by confounding features in a consistent manner.
This abstract was presented at the 2024 ARVO Annual Meeting, held in Seattle, WA, May 5-9, 2024.