Abstract
Purpose :
To examine deep learning-based segmentation of geographic atrophy (GA) lesions and the reliability of derived features.
Methods :
Nine hundred and forty pairs of images were taken from 194 patients, with each pair comprising a fundus autofluorescence (FAF) and near-infrared (NIR) image from 1 eye (Proxima B, NCT02399072). Lesions were annotated on the FAF by a grader, and the data were split at the patient level into training (n=155) and validation (n=39) sets. A test set comprising 90 FAF-NIR pairs from 90 patients (Proxima A, NCT02479386) was annotated by 2 graders (G1 and G2). Two multimodal deep learning networks (UNet and YNet) were trained on the training set and tuned on the validation set. The final network was applied to the test set. For each segmentation mask, the lesion area, perimeter, circularity, Feret diameters, and number of lesions were extracted. As a numerical proxy for the FAF pattern, the excess rim intensity (ERI), equal to the mean FAF intensity in a 0.5-mm rim around the lesion minus the mean FAF intensity in a 0.5- to 1-mm rim around the lesion, was also extracted. For all measures except for number of lesions, the relevant metric was computed for the whole segmented area without separating it into different components.
Results :
The average Dice score between the network and G1 on the test set was 0.92. The Pearson correlation (r) of area, perimeter, circularity, major Feret diameters, minor Feret diameters, ERI, and number of lesions between the YNet and G1 was 0.98, 0.93, 0.86, 0.87, 0.93, 1.00, and 0.46, respectively. Analogous statistics for the network and G2 and for G1 and G2 are given in Table 1.
Conclusions :
Networks trained to segment GA lesions could produce accurate segmentations. When the segmentations were used to obtain the values of area and ERI, the agreement between the networks and human graders was similar to the agreement between two graders. Inferred values of perimeter, circularity, and Feret diameters were less similar, and often varied between models despite similar Dice scores. The inferred number of lesions matched human grading poorly. The variable accuracy of the examined features could be an important factor for their use in predictive models of GA growth.
This is a 2021 ARVO Annual Meeting abstract.