July 2024
Volume 65, Issue 8
Open Access
Retina  |   July 2024
Quantifying Geographic Atrophy in Age-Related Macular Degeneration: A Comparative Analysis Across 12 Deep Learning Models
Author Affiliations & Notes
  • Apoorva Safai
    A-EYE Research Unit, Dept of Ophthalmology and Visual Sciences, University of Wisconsin, Madison, Wisconsin, United States
    Depts of Radiology and Biomedical Engineering, University of Wisconsin, Madison, Wisconsin, United States
  • Colin Froines
    Wisconsin Reading Center, Dept of Ophthalmology and Visual Sciences, University of Wisconsin, Madison, Wisconsin, United States
  • Robert Slater
    A-EYE Research Unit, Dept of Ophthalmology and Visual Sciences, University of Wisconsin, Madison, Wisconsin, United States
  • Rachel E. Linderman
    A-EYE Research Unit, Dept of Ophthalmology and Visual Sciences, University of Wisconsin, Madison, Wisconsin, United States
    Wisconsin Reading Center, Dept of Ophthalmology and Visual Sciences, University of Wisconsin, Madison, Wisconsin, United States
  • Jacob Bogost
    A-EYE Research Unit, Dept of Ophthalmology and Visual Sciences, University of Wisconsin, Madison, Wisconsin, United States
  • Caleb Pacheco
    A-EYE Research Unit, Dept of Ophthalmology and Visual Sciences, University of Wisconsin, Madison, Wisconsin, United States
  • Rickie Voland
    Wisconsin Reading Center, Dept of Ophthalmology and Visual Sciences, University of Wisconsin, Madison, Wisconsin, United States
  • Jeong Pak
    Wisconsin Reading Center, Dept of Ophthalmology and Visual Sciences, University of Wisconsin, Madison, Wisconsin, United States
  • Pallavi Tiwari
    Depts of Radiology and Biomedical Engineering, University of Wisconsin, Madison, Wisconsin, United States
  • Roomasa Channa
    A-EYE Research Unit, Dept of Ophthalmology and Visual Sciences, University of Wisconsin, Madison, Wisconsin, United States
  • Amitha Domalpally
    A-EYE Research Unit, Dept of Ophthalmology and Visual Sciences, University of Wisconsin, Madison, Wisconsin, United States
    Wisconsin Reading Center, Dept of Ophthalmology and Visual Sciences, University of Wisconsin, Madison, Wisconsin, United States
  • Correspondence: Amitha Domalpally, 301 S Westfield Rd, Suite 200, Madison, WI 53717, USA; domalpally@wisc.edu
Investigative Ophthalmology & Visual Science July 2024, Vol.65, 42. doi:https://doi.org/10.1167/iovs.65.8.42
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Apoorva Safai, Colin Froines, Robert Slater, Rachel E. Linderman, Jacob Bogost, Caleb Pacheco, Rickie Voland, Jeong Pak, Pallavi Tiwari, Roomasa Channa, Amitha Domalpally; Quantifying Geographic Atrophy in Age-Related Macular Degeneration: A Comparative Analysis Across 12 Deep Learning Models. Invest. Ophthalmol. Vis. Sci. 2024;65(8):42. https://doi.org/10.1167/iovs.65.8.42.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose: AI algorithms have shown impressive performance in segmenting geographic atrophy (GA) from fundus autofluorescence (FAF) images. However, selection of artificial intelligence (AI) architecture is an important variable in model development. Here, we explore 12 distinct AI architecture combinations to determine the most effective approach for GA segmentation.

Methods: We investigated various AI architectures, each with distinct combinations of encoders and decoders. The architectures included three decoders—FPN (Feature Pyramid Network), UNet, and PSPNet (Pyramid Scene Parsing Network)—and serve as the foundation framework for segmentation task. Encoders including EfficientNet, ResNet (Residual Networks), VGG (Visual Geometry Group) and Mix Vision Transformer (mViT) have a role in extracting optimum latent features for accurate GA segmentation. Performance was measured through comparison of GA areas between human and AI predictions and Dice Coefficient (DC).

Results: The training dataset included 601 FAF images from AREDS2 study and validation included 156 FAF images from the GlaxoSmithKline study. The mean absolute difference between grader measured and AI predicted areas ranged from −0.08 (95% CI = −1.35, 1.19) to 0.73 mm2 (95% CI = −5.75,4.29) and DC between 0.884–0.993. The best-performing models were UNet and FPN frameworks with mViT, and the least-performing models were PSPNet framework.

Conclusions: The choice of AI architecture impacts GA segmentation performance. Vision transformers with FPN and UNet architectures demonstrate stronger suitability for this task compared to Convolutional Neural Network– and PSPNet-based models. Selecting an AI architecture must be tailored to the specific goals of the project, and developers should consider which architecture is ideal for their project.

In recent years, the field of artificial intelligence (AI) has proven highly effective in medical image analysis. Typically, AI model development is a multistep process, starting with selection of appropriate AI architecture, curation of training and testing datasets, and fine tuning of hyper parameters. The model architecture serves as the foundation, determining its ability to capture fine details from the images. A variety of architectures are available, each with its own unique strengths and characteristics.1 Segmentation AI architecture consists of a framework within which encoder-decoder combinations are configured. The encoders are responsible for extracting relevant information from images, and decoders generate meaningful output based on the extracted features. The choice of encoder and decoder combinations and their organization in a framework forms the basis of the model architecture. The method of decoding has come to take on the name of the entire model or architecture, and the encoder has become a swappable sub-piece of the overall architecture. Other variables, called hyperparameters, including learning rate, batch size, and number of layers of the model also need tuning to achieve a balance between model's performance and computational cost. This process ensures that the model can generalize effectively on unseen data, a critical requirement for real world applications. 
Age-related macular degeneration (AMD) is a leading cause of vision impairment worldwide, with geographic atrophy (GA) being a hallmark of its advanced stage.2 GA is characterized by the progressive loss of retinal pigment epithelium (RPE) cells and photoreceptors, resulting in irreversible vision impairment.3 Accurate measurement of GA is crucial for disease assessment and monitoring, as well as for developing and evaluating potential treatments. Fundus autofluorescence (FAF) imaging using blue light autofluorescence has emerged as the preferred modality for imaging GA as it provides high contrast grey scale images where atrophy presents as dark (hypo autofluorescence) lesions with distinct boundaries.4 Enlargement of GA area FAF images is an important outcome for clinical trials.5 Traditional manual measurements are time-consuming and subject to inter- and intra-observer variability.6 Additionally, these measurements are typically performed in reading centers using methods not amenable within the busy clinic workflow. 
Segmentation of GA using deep learning models has been generally successful.711 However, owing to the variable phenotypes and manifestations of GA, accurate identification of the hypo-autofluorescence boundary and generating precise segmentation across datasets is a challenging task.12 Selection of proper AI architecture that captures the heterogenous patterns of GA is critical for attaining robust segmentations and better model performance. A comprehensive evaluation of popular segmentation frameworks could act as a guideline for selecting AI architecture that yields high performance for the GA segmentation task. In this study, we explored three different segmentation architectures combined with four widely used encoders leading 12 distinct combinations of AI models to identify the best AI framework for segmentation of GA on FAF images. 
Methods
Training Dataset: Image Acquisition
Age-Related Eye Disease Study 2 (AREDS2)13 was a multicenter randomized clinical trial designed to study the effects of oral supplements on progression to advanced AMD. The study was conducted under institutional review board approval at each site and written informed consent was obtained from all study participants. The research was conducted under the Declaration of Helsinki and complied with the Health Insurance Portability and Accountability Act. Participants at high risk of developing late AMD because of either bilateral large drusen or late AMD in one eye and large drusen in the fellow eye were enrolled. Development of either central GA or neovascular AMD was the primary AREDS2 study outcome. 
An autofluorescence ancillary study was initiated to obtain autofluorescence images from a subset of participating clinics (36 of 90 sites) based on availability of imaging equipment.14 Sites were permitted to join the ancillary study at any time after imaging equipment became available during the study period between the first AREDS2 visit and five-year follow-up visit (2007–2013). FAF images were obtained from Heidelberg Retinal Angiograph (HRA, Heidelberg, Germany) by certified photographers. A single image was acquired at 30° centered on the macula, captured in high-speed mode (768 × 768 pixels) using the automated real time mean function set at 14. Images were exported as tiff format to the Wisconsin Reading Center (formerly Fundus Photograph Reading Center) for evaluation by certified graders. 
For this project, FAF images with GA were included from AREDS2 study visits at year 4, 5 and 6, because that was the time frame where most sites with FAF capabilities joined the ancillary study to maximize the diversity of sites. There were 1501 FAF images corresponding to these visits. A total of 601 FAF images from 362 eyes (271 participants) with GA were considered from the AREDS2 study for training the segmentation models. Of these 200 (55.2%) eyes had only one visit, 162(44.8%) eyes had two or more visits. Cross-validation required split at subject level. 
Training Dataset: Image Labelling
The ground truth labels of GA were manually drawn by three independent graders. Images were split between three independent graders with a random subset undergoing repeat grades for intergrader agreement. Hypo-autofluorescence or GA was classified as well-defined, homogenously black areas with a minimum size of 250 microns in its widest diameter or an area greater than 0.05 mm2.15 Areas of hypo autofluorescence within the entire macula centered FAF image were demarcated using Photoshop (Adobe Inc. v 24.4.1) with a red outline and filled in with the paint bucket tool. Images were deemed ungradable and excluded from this study if the border of GA merged with peripapillary atrophy and could not be distinguished, if the GA extended outside the field of the image, or if poor image quality prevented clear delineation of GA borders. In Heidelberg FAF images, the macula was assumed to be involved if the hypo-autofluorescent patch merged with the darkness of the macula and there was no clear region demarcating the two. Optical coherence tomography (OCT) images, which provide a more accurate assessment of foveal involvement, were not available. The intergrader agreement on detecting foveal involvement of GA was 84% (kappa 0.07). A 200 µm scale bar is provided on Heidelberg FAF images and was used to calibrate the images. The pixels in red were converted to area measurements in millimeters squared. Areas were summed for eyes with multifocal GA to yield a single value. 
Dataset: External Validation (Testing)
External validation was performed using all screening visit FAF images from a phase 2 study conducted by GlaxoSmithKline (GSK) between 2011 and 2016 (NCT NCT01342926).16 This was a multicenter study conducted across 40 centers in United States and Canada and the study concluded that the experimental drug did not slow the enlargement rate of GA compared with placebo. All images were obtained by certified photographers and included 30-degree images centered on the fovea at high speed (768 × 768 resolution) or high resolution (1536 × 1536) and image averaging (ART mean function) set to 25. Inclusion criteria required well-demarcated GA with an area of 1.9 to 17 mm2 measured on color fundus photographs of the study eye. For multifocal GA, at least one of the foci had to be ≥1.9 mm2 and the total area of GA had to measure ≤17 mm2. However, all images submitted for the study irrespective of inclusion criteria range were included in the dataset. Inclusion criteria also required a best corrected visual acuity score of 35 letters or more in the study eye. FAF images were obtained as supplementary images using the same procedures as AREDS2 but were exported to the reading center in the Heidelberg proprietary e2e format. GA segmentation was performed using the same procedures as above in Heidelberg software. Images with annotations were exported in tiff format for AI validation. The external validation dataset consisted of 156 images (156 eyes, 100 participants). 
Comparative Models for Segmentation
We investigated a range of AI models, each with distinct combinations of segmentation architectures and encoder modules. The segmentation architectures included in this study were UNet, FPN (Feature Pyramid Network), and PSPNet (Pyramid Scene Parsing Network), whereas the pretrained encoders consisted of EfficientNet, ResNet (Residual Networks), VGG (Visual Geometry Group), and mViT (Mixed Vision Transformer). This combination resulted in a diverse set of 12 segmentation models. 
We choose the above-mentioned segmentation architectures due to their favorable and distinct attributes of extracting detailed, fine grained and multiscale features for segmentation tasks.17 Figure 1 shows a schematic of these architectures. UNet is a U-shaped encoder-decoder model where the input picture/data is compressed into a feature vector (i.e., it shrinks an image down to its key features and then builds it back up to create a detailed map of the image with segmentation).18 This helps the model focus on important features by reducing complexity. FPN with its pyramidal structure is an advancement of UNet and makes predictions at multiple resolutions and combines them for an overall prediction.19 Last, the PSPNet model uses an innovative approach through its adaptive pooling strategy. It concurrently extracts image feature maps from different regions of the image and then pools them to provide semantic segmentations from the image.20 This aggregation of local feature maps helps the model understand both subtle and largescale context from the image. 
Figure 1.
 
(A) A general view of how a segmentation in an AI model is created. An image is passed to an encoder (orange), which produces a feature map (blue). The feature map is then used as input to a decoder (green) to predict a segmentation mask (red). The encoder may send information directly to the decoder (gray arrows). (B) The general structure and data flow for the UNet architecture. The encoder produced a feature map of reduced height and width dimension which is then upscaled by the decoder to produce a final segmentation mask from the final layer of the decoder. (C) The general structure for an FPN. The encoder is unchanged when compared to UNet but the decoder now has extra internal connections that contribute to the final prediction. These additional connections help the decoder with scale or resolution. (D) The structure of a PSPNet architecture. The encoder again is mostly unchanged, but the decoder pools the feature map at various sizes and then combines the results to create a final segmentation mask.
Figure 1.
 
(A) A general view of how a segmentation in an AI model is created. An image is passed to an encoder (orange), which produces a feature map (blue). The feature map is then used as input to a decoder (green) to predict a segmentation mask (red). The encoder may send information directly to the decoder (gray arrows). (B) The general structure and data flow for the UNet architecture. The encoder produced a feature map of reduced height and width dimension which is then upscaled by the decoder to produce a final segmentation mask from the final layer of the decoder. (C) The general structure for an FPN. The encoder is unchanged when compared to UNet but the decoder now has extra internal connections that contribute to the final prediction. These additional connections help the decoder with scale or resolution. (D) The structure of a PSPNet architecture. The encoder again is mostly unchanged, but the decoder pools the feature map at various sizes and then combines the results to create a final segmentation mask.
Among the encoder models used in this study for segmentation, EfficientNet, ResNet and VGG are widely used convolutional neural network (CNN) models, whereas mViT is based on the recently popular transformer model. EfficientNet is an encoder module designed to improve the efficiency of neural networks by balancing and optimizing model dimensions such as depth, width, and resolution.21 In segmentation tasks, this optimization is valuable for achieving better performance with fewer resources. ResNet model is composed of residual blocks, containing skip connections which allows direct flow of information across layers, thereby facilitating better training of a deeper model, with large number of layers.22 With this residual learning strategy ResNet can effectively capture hierarchical image features for segmentation. VGG model involves stack of convolutional layers, where the kernel size can be varied to extract more intricate features for segmentation task. Last, mViT is a vision transformer, that leverages self-attention mechanisms to capture long-range dependencies and global context in data.23 This ability of mViT to understand relationships within the entire image and capture dependencies in spatially distant regions can aid in complex segmentation tasks. Thus each of the CNN and transformer-based encoder-decoder modules possess different attributes that can be beneficial for obtaining precise segmentations on retinal images. 
Training of Segmentation Models
All segmentation models were trained and validated on the AREDS2 dataset and tested on the GSK dataset. A fivefold cross-validation (CV) scheme was implemented on the AREDS2 dataset with each fold containing 120 input images split at subject level. All images were first preprocessed. Images with a Heidelberg label bar had the bar cropped off, resulting in a perfectly square image. The bar was always 100 pixels tall and at the bottom of the picture. Most images were in three-channel RGB format (even though the content was black and white). Those that were single channel were converted to three channels. Images were then resized to dimensions 512 × 512 pixels. The various encoder modules pretrained on ImageNet were incorporated with different designs parameters. For instance, the EfficientNet encoder was scaled at b5 level (parameters = 28 M), ResNet encoder was designed with 101 layers (parameters = 42 M), VGG encoder contained 19 layers (parameters = 20 M) and mViT encoder of type SegFormerB5 (parameters = 81 M). All comparative models were trained in a uniform manner, with a batch size of 4 and learning rate of 0.001, constrained to the available computational resources. We implemented early stopping to prevent models from overfitting by monitoring the validation loss with a patience of 10 epochs. All models were implemented on NVIDIA QUADRO RTX 5000 GPU using Pytorch Lightning and the Segmentation Models Pytorch library. The segmentation performance of all models was evaluated on the external GSK dataset. 
Performance Evaluation of Segmentation Models
Statistical Assessment of AI Performance
The segmented GA mask was converted to mm2 using known pixel size and the area of segmented GA mask was computed. GA characteristics of the segmented masks were outlined using summary statistics such as area of GA mask. Model performance for each of the 12 models was measured using mean difference, correlation, and dice coefficient between area of segmented GA mask and GA label by human graders. Dice coefficient is measured as the ratio of twice the intersection of the predicted and ground truth regions to the sum of the sizes of the predicted and ground truth regions on an image. A Dice score closer to 1 indicates excellent agreement in spatial overlap of segmented pixels between AI and grader. Dice coefficient was generated for a single internal validation set and for a fivefold CV framework, where the average score from all folds is reported. Additional performance metrics for the segmentation task such as Jaccard index, precision and recall were also computed for all 12 models.24 Owing to the variable manifestation of GA, the performance of all models was further specifically evaluated for segmentation of unifocal versus multifocal GA and sub foveal GA subtype. 
Grader Assessment of AI Performance
Additionally, a subjective assessment was incorporated by three masked graders (CF, JB, and CP) who visually evaluated and rated the performance of each image in the external testing set. For each image, a panel of 13 segmentation masks were labeled 1-13 without indication of ground truth (human annotation) along with each of the 12 architecture annotations. The graders used a four-level scoring system to assess their agreement with the segmentation mask. A score of 1 was defined as excellent segmentation with no edits to the masks needed, 2 was defined as good with only minor edits. A score of 3 was defined as fair but with some critical errors that needed to be edited (e.g., foveal involvement or peripapillary involvement) (Fig. 2B). Finally, a score of 4 was defined as poor with the segmentation needing to be mostly redone. Images were evenly split among the three graders with a subset graded by all three to ascertain reproducibility. The agreement among the graders was 67%, with most disagreements in the excellent/good scores. 
Figure 2.
 
Fundus autofluorescence image with segmentation of geographic atrophy by human grader (treated as the ground truth) and predictions from the 12 AI architectures. (A) A multifocal foveal GA that is segmented accurately by UNet and FPN architectures but not by PSPNet. (B) A multifoveal, extrafoveal GA with peripapillary atrophy (arrow). The vision transformer models FPN_mViT and UNet_mViT agree the most with ground truth. Almost all other models have difficulty separating peripapillary atrophy from GA.
Figure 2.
 
Fundus autofluorescence image with segmentation of geographic atrophy by human grader (treated as the ground truth) and predictions from the 12 AI architectures. (A) A multifocal foveal GA that is segmented accurately by UNet and FPN architectures but not by PSPNet. (B) A multifoveal, extrafoveal GA with peripapillary atrophy (arrow). The vision transformer models FPN_mViT and UNet_mViT agree the most with ground truth. Almost all other models have difficulty separating peripapillary atrophy from GA.
Results
The detailed demographic characteristics of the training and test datasets are presented in Table 1. A visual representation of predicted GA masks by all 12 models on a single training dataset is shown in Figures 2A and B. The mean area of GA for AREDS2 training dataset was 6.65 mm2 (SD 6.30, range 0.10–36.30) and for the testing (external validation) was 9.79 mm2 (SD = 5.60, range 0.4–24.3). 
Table 1.
 
GA Characteristics From Autofluorescence Images in the Internal Cross-Validation and External Validation/Test Datasets
Table 1.
 
GA Characteristics From Autofluorescence Images in the Internal Cross-Validation and External Validation/Test Datasets
Model Performance on Dice Coefficients
In the AREDS2 dataset, mean CV dice coefficient from all 12 models was in the range of 0.827 to 0.928, with lowest score obtained on PSP_ResNet and highest score demonstrated by FPN_mViT and UNet_mViT models, as shown in Table 2 and in Supplementary Figure S1 of Supplementary Material. Models showed similar performance in GSK external validation (EV) dataset, with dice ranging between 0.877 to 0.939 as shown in Table 3 and Figure 3. All models showed moderately high performance on precision (CV range 0.877–0.940, EV range 0.930–0.967) and recall (CV range 0.877–0.940, EV range 0.930–0.967) as shown in Supplementary Table S1. Performance of all 12 models on test dataset is shown in Figure 4 displaying the distribution of Dice Coefficient based on GA area. All models unanimously depicted higher dice score and better segmentation for large-sized GAs, while the models showed a more variable performance on smaller areas of GA. Among all 12 models, PSPNet demonstrated the highest variation in dice scores across differently sized GAs, whereas UNet_mViT and FPN_mViT showed more consistent performance across different GA sizes. 
Table 2.
 
Performance Metrics for the 12 AI Models Assessing GA Area in the Cross-Validation AREDS2 Dataset
Table 2.
 
Performance Metrics for the 12 AI Models Assessing GA Area in the Cross-Validation AREDS2 Dataset
Table 3.
 
Performance Metrics for the 12 AI Models Assessing GA Area in the GSK External Validation/Test Dataset
Table 3.
 
Performance Metrics for the 12 AI Models Assessing GA Area in the GSK External Validation/Test Dataset
Figure 3.
 
Box plot showing the distribution of dice coefficients across the 12 models in the test dataset. A dice score closer to 1 indicates excellent agreement in spatial overlap of segmented pixels between AI and grader. All models have a Dice score >0.8. The mVit with either FPN or UNet have the lowest variability in dice coefficients.
Figure 3.
 
Box plot showing the distribution of dice coefficients across the 12 models in the test dataset. A dice score closer to 1 indicates excellent agreement in spatial overlap of segmented pixels between AI and grader. All models have a Dice score >0.8. The mVit with either FPN or UNet have the lowest variability in dice coefficients.
Figure 4.
 
Scatterplots displaying the distribution of dice coefficients from all 12 models for GA segmentation of variable sizes in the test dataset. Overall, dice coefficients are generally better for larger areas of GA in comparison to smaller areas indicating that the models perform better with larger areas.
Figure 4.
 
Scatterplots displaying the distribution of dice coefficients from all 12 models for GA segmentation of variable sizes in the test dataset. Overall, dice coefficients are generally better for larger areas of GA in comparison to smaller areas indicating that the models perform better with larger areas.
Model Performance on Measurement of GA Area
The mean difference in GA area between ground truth and masks generated by each of the 12 models in the cross-validation dataset ranged from −0.73 (95% CI: −5.75, 4.29) to 0.16 (95% CI: −3.35, 3.65) as shown in Table 2. The lowest mean GA area differences were displayed by FPN_Efficientnet and FPN_mViT models. Among all encoders, ResNet demonstrated the highest area difference between ground truth and predicted GA masks, across all 12 models on training dataset. In comparison, the intergrader assessment of GA area in 47 eyes showed a mean difference of 0.36 (−1.03, 1.75) mm2 with a Dice coefficient of 0.99.25 
Similar performance was depicted on the GSK test dataset, with highest variability seen in area measurements from PSPNet architectures whereas mViT based FPN and UNet displayed lowest difference in predicted and ground truth GA area, as shown in Table 3 and Figure 5. Most of the segmentation models compared in this study demonstrated a tendency to overestimate the GA area. 
Figure 5.
 
Bland-Altman plots showing the difference between AI-generated segmentation and the ground truth segmentation by human graders of GA areas using the test dataset. All models with the PSPnet architecture have a wide distribution of datapoints indicating weaker agreement between the AI and ground truth. Both FPN and UNet architectures with mViT have the tightest agreement across the range of GA areas.
Figure 5.
 
Bland-Altman plots showing the difference between AI-generated segmentation and the ground truth segmentation by human graders of GA areas using the test dataset. All models with the PSPnet architecture have a wide distribution of datapoints indicating weaker agreement between the AI and ground truth. Both FPN and UNet architectures with mViT have the tightest agreement across the range of GA areas.
Model Performance Based on GA Subtypes
We evaluated the performance of all 12 models on GA subtypes, to identify any architecture specific trends in performance for subfoveal vs extrafoveal and unifocal vs multifocal GA. The performance of models on testing dataset shown in Supplementary Figure S2, predominantly indicates larger variability in dice score from PSPNet_ResNet model for smaller sized unifocal GA masks and a more consistent performance by mViT based UNet and FPN models. Similar performance trends were observed for subfoveal and extrafoveal GA, with PSP models showing higher variability for both smaller sized GA of both subtypes, as shown in Supplementary Figure S3. No differences were found in model performance in these subtypes of GA. 
Grader Assessment of Model Performance
Anonymized segmentations masks were presented to masked graders for scoring and included 12 AI generated masks and the ground truth. The graders were unaware of which segmentations were done by the human grader nor were they aware of which architecture was used. The ground truth received a score of excellent or good in >95% of images (Fig. 6). The vision transformers (mViT) with UNet and FPN framework received the highest scores among AI generated masks with >75% in the excellent or good category. All other FPN and UNet based architectures ranged around 70% for excellent or good category whereas all PSPNet architectures ranged around 40%. 
Figure 6.
 
Grader evaluation of AI segmentation model performance on the test dataset using a four-level score. The stacked bar chart displays the percentage distribution of grader scores for each of the 12 models. Graders were presented with 13 distinct masks for scoring for each raw fundus autofluorescence image. The graders were masked to which segmentation was the ground truth as well as the AI architectures.
Figure 6.
 
Grader evaluation of AI segmentation model performance on the test dataset using a four-level score. The stacked bar chart displays the percentage distribution of grader scores for each of the 12 models. Graders were presented with 13 distinct masks for scoring for each raw fundus autofluorescence image. The graders were masked to which segmentation was the ground truth as well as the AI architectures.
Discussion
In this study, we performed a comparative analysis of various AI architectures to understand the implications of architecture variability in GA segmentation. Use of 12 architectures consisting of four encoders and three decoders allowed us to systematically compare variability of combinations. Model performance was compared using Dice coefficient, difference in area between ground truth and AI mask, and a subjective assessment of the GA mask by expert graders. These metrics were evaluated for both the AREDS2 cross-validation dataset and the GSK test dataset. The results show that all 12 architectures have comparable metrics with high performance overall with a Dice score >0.8 across all models. Within the 12 models, the vision transformers encoder with a UNet or FPN architecture performed the best with dice coefficient of 0.93 and least variability on area differences, and >75% GA predictions scored as good or excellent by graders. In contrast the PSPNet architecture with all encoder combinations had the lowest performance with a dice coefficient of <0.9, highest variability in area differences and <40% scoring good or excellent by graders. 
There is limited research in effect of architectures on retinal imaging.26,27 Kugelman et al.26 evaluated various UNet architectures for segmenting optical OCT scans and found comparable performance across all architectures. Most studies evaluating deep learning for segmenting GA on two-dimensional autofluorescence images have used UNet architecture.10,28 UNet and FPN architectures are similar in their internal structure, with the main difference being that FPN makes predictions at multiple resolutions or scales, whereas UNet makes predictions only at the finest resolution. PSPNet architecture, on the other hand, exploits the global context in the image by pooling or aggregating features from different regions of the feature map, using fully connected layers rather than a skip connection between encoder and decoder framework. This major architectural difference probably leads to inefficient decoding of spatial and global contextual information in the image by PSPNet architecture, thereby limiting its generalization capability in segmenting different phenotypes of GA as depicted in Figure 2
Both UNet and FPN architectures showed a similar superior segmentation performance compared to PSPNet, with FPN performing slightly better. This could be attributed to the top-down approach used by FPN architecture that enables it to capture multiscale features that can extract fine details and global context in the FAF images, for a robust GA segmentation. In terms of encoders, overall performance is largely similar across CNN based models, with EfficientNet demonstrating higher dice scores, lesser variability compared and better segmentation compared to others. Vision transformers (mViT) showed a consistent performance across the range of area of GA with tight confidence limits compared to CNN models, as shown in Figure 5. This difference could be due to the architectural constraint of CNN based encoders, which are primarily designed to capture local features effectively and therefore may struggle to model long-range dependencies and in capturing global context in the image. This may lead to poor segmentation of distant multifocal GA lesions. There is a possibility of a different trend in relative segmentation performance of encoders depending on the image resolution and model hyperparameters. Considering the vision transformers (mViT) recent emergence as an alternative to conventional CNNs, there is potential for yet undiscovered optimizations that may further enhance its performance.29 
The Dice coefficient is high (>0.87) for all models; however, the box plots in Figure 3 show outliers across all models, particularly with PSPNet. The Dice Coefficient relies on the intersection of pixels between the two segmentations being compared and the total number of pixels in the ground truth and predictions. When the segmentation area is small, false-positive and false-negative results have a more substantial effect on the Dice Coefficient compared to larger areas. Smaller areas have smaller number of pixels to match on and even if a few mismatches lower the Dice. This is clearly seen in Figure 4 where the Dice is lower for smaller lesions and improves with larger area. Bland-Altman plots and Dice coefficients offer distinct viewpoints for comparison, one offering numerical differences and the other spatial overlap. In a scenario where different lesions of similar area are measured by AI and ground truth, the Bland Altman plots could indicate misleadingly excellent agreement, but Dice coefficients reveal poor performance. 
Although statistical metrics provide an insight into model performance, this project used a unique subjective assessment. In clinical trials, GA assessment is usually performed by two graders with a third senior grader reviewing the evaluation and selecting the more accurate one as the definitive version. A similar exercise was completed with 13 unlabeled segmentations presented to the graders (AI predictions and ground truth) for assessment of accuracy. Despite the masking, ground truth segmentations were given an excellent/good score in >95% whereas the best performing vision transformers received similar scores in ∼75% of segmentations. This highlights the importance of using a multifaceted assessment of AI models using real world methods. In addition, other factors such as computational time, the energy required for training and inference (and thus the carbon footprint and real monetary cost), and ease of integration into clinical workflows are also critical to evaluate the model's practical utility. These considerations impact not just the feasibility of deploying a model in a real-world setting but also its environmental and economic sustainability.30 Balancing these aspects with performance metrics is essential for a holistic assessment of AI models. 
This project was conceived with the hypothesis that certain architectures would perform better with specific phenotypes, such as unifocal versus multifocal GA and extrafoveal vs subfoveal GA. Multifocal GA is more complicated to segment as multiple lesions need to be identified and segmented. Segmentation of extrafoveal GA is complicated as the normal decreased autofluorescence of the fovea in FAF images needs to be differentiated from that of GA. Lack of OCT is a limitation for accurate assessment of foveal involvement in this dataset. As an extension to this hypothesis, our proposal was to selectively invoke specific AI models based on the phenotype presented. However, no significant variance was observed in model performance across different phenotypes such as multifocal/unifocal and subfoveal/extrafoveal, leading us away from the invocation concept. The general trends of performance in the overall dataset reflected in various phenotypes also, with vision transformers encoders with UNet and FPN architectures performing the best across all phenotypes. We then considered an ensemble model incorporating the highest-performing architectures. The ensemble model generated predictions by averaging the predictions from the top 4 models- FPN_EfficientNet, FPN_mViT, UNet_EfficientNet and UNet_mViT. The performance improvement of the ensemble model was marginal compared to the best performing framework, with a dice score of 0.931 on the validation split from AERDS2 dataset, and dice score of 0.925 on the GSK test dataset. The incremental gains did not convincingly outweigh the added complexity. 
This study's strengths are in the comprehensive comparison of 12 architectures in a large phenotypically diverse dataset such as AREDS2 and external validation in a clinical trial dataset. The AREDS2 study included eyes with intermediate AMD in one of or both eyes and as such did not have an area cutoff, including both prevalent and incident GA. As such the training dataset had a large area range from 0.1–36.3 mm2. One of the challenges with real world implementation of AI models is degradation of model performance, primarily because of selective nature of the training data. Models trained on selected data may not perform well in the real world. In contrast, the model in this project was trained on nearly real-world representative images with diverse presentations of GA from AREDS2 dataset and tested on a clinical trial selective data.25 Ground truth was established by experienced readers and performance metrics used grader workflow. To maintain comparison between the various architectures, we controlled various hyperparameters such as the optimizer, learning rate and general model size (number of parameters). Standardizing hyperparameters might not have caused each model to perform optimally, and the limited improvement from the ensemble approach suggests a complexity-performance trade-off. 
This research provides a careful assessment of AI architectures for GA segmentation and concludes that vision transformers offer the best performance for this specific task and dataset. It underscores the need for ongoing assessments of emerging AI technologies to optimize performance on the metrics of interest, particularly for slow growing lesions such as GA, where precise measurements are critical. It is important to explore and assess a variety of architectures, to identify the most effective approach for the specific needs and challenges of medical imaging. Selecting and testing for the appropriate AI architecture is an important part of model development and should be aligned with the project's distinct objectives. 
Acknowledgments
This publication is based on research using data from GSK that has been made available through CSDR secured access. GSK has not contributed to or approved, and is not in any way responsible for, the contents of this publication. We thank both GSK and CSDR for providing us data and access. 
Supported in part by an unrestricted grant from Research to Prevent Blindness, Inc. to the UW Madison Department of Ophthalmology and Visual Sciences. 
Disclosure: A. Safai, None; C. Froines, None; R. Slater, None; R.E. Linderman, None; J. Bogost, None; C. Pacheco, None; R. Voland, None; J. Pak, None; P. Tiwari, Johnson & Johnson (C), LivAI Inc. (C, I, O), no conflict with presented work; R. Channa, None; A. Domalpally, None 
References
Litjens G, Kooi T, Bejnordi BE, et al. A survey on deep learning in medical image analysis. Med Image Anal. 2017; 42: 60–88. [CrossRef] [PubMed]
Klein R, Chou CF, Klein BE, Zhang X, Meuer SM, Saaddine JB. Prevalence of age-related macular degeneration in the US population. Arch Ophthalmol. 2011; 129: 75–80. [CrossRef] [PubMed]
Keenan TD, Agrón E, Domalpally A, et al. Progression of geographic atrophy in age-related macular degeneration: AREDS2 report number 16. Ophthalmology. 2018; 125: 1913–1928. [CrossRef] [PubMed]
Holz FG, Sadda SR, Staurenghi G, et al. Imaging protocols in clinical studies in advanced age-related macular degeneration: recommendations from classification of atrophy consensus meetings. Ophthalmology. 2017; 124: 464–478. [CrossRef] [PubMed]
Csaky K, Ferris F, 3rd, Chew EY, Nair P, Cheetham JK, Duncan JL. Report from the NEI/FDA endpoints workshop on age-related macular degeneration and inherited retinal diseases. Invest Ophthalmol Vis Sci. 2017:9: 3456–3463.
Schmitz-Valckenberg S, Brinkmann CK, Alten F, et al. Semiautomated image processing method for identification and quantification of geographic atrophy in age-related macular degeneration. Invest Ophthalmol Vis Sci. 2011; 52: 7640–7646. [CrossRef] [PubMed]
Keenan TD, Dharssi S, Peng Y, et al. A deep learning approach for automated detection of geographic atrophy from color fundus photographs. Ophthalmology. 2019; 126: 1533–1540. [CrossRef] [PubMed]
Arslan J, Samarasinghe G, Benke KK, et al. Artificial intelligence algorithms for analysis of geographic atrophy: a review and evaluation. Transl Vis Sci Technol. 2020; 9(2): 57. [CrossRef] [PubMed]
Anegondi N, Gao SS, Steffen V, et al. Deep learning to predict geographic atrophy area and growth rate from multimodal imaging. Ophthalmol Retina. 2023; 7: 243–252. [CrossRef] [PubMed]
Spaide T, Jiang J, Patil J, et al. Geographic atrophy segmentation using multimodal deep learning. Transl Vis Sci Technol. 2023; 12(7): 10. [CrossRef] [PubMed]
Yang Q, Anegondi N, Steffen V, Rabe C, Ferrara D, Gao SS. Multi-modal geographic atrophy lesion growth rate prediction using deep learning. Invest Ophthalmol Vis Sci. 2021; 62: 235–235.
Li AS, Myers J, Stinnett SS, Grewal DS, Jaffe GJ. Gradeability and reproducibility of geographic atrophy measurement in GATHER-1, a phase II/III randomized interventional trial. Ophthalmol Sci. 2024; 4(2): 100383. [CrossRef] [PubMed]
Chew EY, Clemons T, SanGiovanni JP, et al. The Age-related Eye Disease Study 2 (AREDS2) Study Design and Baseline Characteristics (AREDS2 Report Number 1). Ophthalmology. 2012; 119: 2282–2289. [CrossRef] [PubMed]
Domalpally A, Danis R, Agron E, Blodi B, Clemons T, Chew E. Evaluation of geographic atrophy from color photographs and fundus autofluorescence images: age-related eye disease study 2 report number 11. Ophthalmology. 2016; 123: 2401–2407. [CrossRef] [PubMed]
Sadda SR, Guymer R, Holz FG, et al. Consensus definition for atrophy associated with age-related macular degeneration on OCT: classification of atrophy report 3. Ophthalmology. 2018; 125: 537–548. [CrossRef] [PubMed]
Rosenfeld PJ, Berger B, Reichel E, et al. A randomized phase 2 study of an anti-amyloid β monoclonal antibody in geographic atrophy secondary to age-related macular degeneration. Ophthalmol Retina. 2018; 2: 1028–1040. [CrossRef] [PubMed]
Sang S, Zhou Y, Islam MT, Xing L. Small-object sensitive segmentation using across feature map attention. IEEE Trans Pattern Anal Mach Intell. 2023; 45: 6289–6306. [CrossRef] [PubMed]
Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III. Berlin: Springer International Publishing; 2015: 234–241.
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 2117–2125.
Zhao H, Shi J, Qi X, Wang X, Jia J. Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 2881–2890.
Mathews MR, Anzar SM, Krishnan RK, Panthakkan A. EfficientNet for retinal blood vessel segmentation. In: 2020 3rd International Conference on Signal Processing and Information Security (ICSPIS). 2020: 1–4.
Xu W, Fu Y-L, Zhu D. ResNet and its application to medical image processing: research progress and challenges. Comput Methods Programs Biomed. 2023; 240: 107660. [CrossRef] [PubMed]
Xiao H, Li L, Liu Q, Zhu X, Zhang Q. Transformers in medical image segmentation: a review. Biomed Signal Proc Control. 2023; 84: 104791. [CrossRef]
Müller D, Soto-Rey I, Kramer F. Towards a guideline for evaluation metrics in medical image segmentation. BMC Res Notes. 2022; 15: 210. [CrossRef] [PubMed]
Domalpally A, Slater R, Linderman R, et al. Strong vs weak data labeling for artificial intelligence algorithms in the measurement of geographic atrophy. Ophthalmol Sci. 2024; 4(5): 100477. [CrossRef] [PubMed]
Kugelman J, Allman J, Read SA, et al. A comparison of deep learning U-Net architectures for posterior segment OCT retinal layer segmentation. Sci Rep. 2022; 12(1): 14888. [CrossRef] [PubMed]
Domínguez C, Heras J, Mata E, Pascual V, Royo D, Zapata MÁ. Binary and multi-class automated detection of age-related macular degeneration using convolutional- and transformer-based architectures. Comput Methods Programs Biomed. 2023; 229: 107302. [CrossRef] [PubMed]
Arslan J, Samarasinghe G, Sowmya A, et al. Deep learning applied to automated segmentation of geographic atrophy in fundus autofluorescence images. Transl Vis Sci Technol. 2021; 10(8): 2. [CrossRef]
Khan RF, Lee BD, Lee MS. Transformers in medical image segmentation: a narrative review. Quant Imaging Med Surg. 2023; 13: 8747–8767. [CrossRef] [PubMed]
Gupta U, Kim YG, Lee S, et al. Chasing carbon: the elusive environmental footprint of computing. In: 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 2021: 854–867.
Figure 1.
 
(A) A general view of how a segmentation in an AI model is created. An image is passed to an encoder (orange), which produces a feature map (blue). The feature map is then used as input to a decoder (green) to predict a segmentation mask (red). The encoder may send information directly to the decoder (gray arrows). (B) The general structure and data flow for the UNet architecture. The encoder produced a feature map of reduced height and width dimension which is then upscaled by the decoder to produce a final segmentation mask from the final layer of the decoder. (C) The general structure for an FPN. The encoder is unchanged when compared to UNet but the decoder now has extra internal connections that contribute to the final prediction. These additional connections help the decoder with scale or resolution. (D) The structure of a PSPNet architecture. The encoder again is mostly unchanged, but the decoder pools the feature map at various sizes and then combines the results to create a final segmentation mask.
Figure 1.
 
(A) A general view of how a segmentation in an AI model is created. An image is passed to an encoder (orange), which produces a feature map (blue). The feature map is then used as input to a decoder (green) to predict a segmentation mask (red). The encoder may send information directly to the decoder (gray arrows). (B) The general structure and data flow for the UNet architecture. The encoder produced a feature map of reduced height and width dimension which is then upscaled by the decoder to produce a final segmentation mask from the final layer of the decoder. (C) The general structure for an FPN. The encoder is unchanged when compared to UNet but the decoder now has extra internal connections that contribute to the final prediction. These additional connections help the decoder with scale or resolution. (D) The structure of a PSPNet architecture. The encoder again is mostly unchanged, but the decoder pools the feature map at various sizes and then combines the results to create a final segmentation mask.
Figure 2.
 
Fundus autofluorescence image with segmentation of geographic atrophy by human grader (treated as the ground truth) and predictions from the 12 AI architectures. (A) A multifocal foveal GA that is segmented accurately by UNet and FPN architectures but not by PSPNet. (B) A multifoveal, extrafoveal GA with peripapillary atrophy (arrow). The vision transformer models FPN_mViT and UNet_mViT agree the most with ground truth. Almost all other models have difficulty separating peripapillary atrophy from GA.
Figure 2.
 
Fundus autofluorescence image with segmentation of geographic atrophy by human grader (treated as the ground truth) and predictions from the 12 AI architectures. (A) A multifocal foveal GA that is segmented accurately by UNet and FPN architectures but not by PSPNet. (B) A multifoveal, extrafoveal GA with peripapillary atrophy (arrow). The vision transformer models FPN_mViT and UNet_mViT agree the most with ground truth. Almost all other models have difficulty separating peripapillary atrophy from GA.
Figure 3.
 
Box plot showing the distribution of dice coefficients across the 12 models in the test dataset. A dice score closer to 1 indicates excellent agreement in spatial overlap of segmented pixels between AI and grader. All models have a Dice score >0.8. The mVit with either FPN or UNet have the lowest variability in dice coefficients.
Figure 3.
 
Box plot showing the distribution of dice coefficients across the 12 models in the test dataset. A dice score closer to 1 indicates excellent agreement in spatial overlap of segmented pixels between AI and grader. All models have a Dice score >0.8. The mVit with either FPN or UNet have the lowest variability in dice coefficients.
Figure 4.
 
Scatterplots displaying the distribution of dice coefficients from all 12 models for GA segmentation of variable sizes in the test dataset. Overall, dice coefficients are generally better for larger areas of GA in comparison to smaller areas indicating that the models perform better with larger areas.
Figure 4.
 
Scatterplots displaying the distribution of dice coefficients from all 12 models for GA segmentation of variable sizes in the test dataset. Overall, dice coefficients are generally better for larger areas of GA in comparison to smaller areas indicating that the models perform better with larger areas.
Figure 5.
 
Bland-Altman plots showing the difference between AI-generated segmentation and the ground truth segmentation by human graders of GA areas using the test dataset. All models with the PSPnet architecture have a wide distribution of datapoints indicating weaker agreement between the AI and ground truth. Both FPN and UNet architectures with mViT have the tightest agreement across the range of GA areas.
Figure 5.
 
Bland-Altman plots showing the difference between AI-generated segmentation and the ground truth segmentation by human graders of GA areas using the test dataset. All models with the PSPnet architecture have a wide distribution of datapoints indicating weaker agreement between the AI and ground truth. Both FPN and UNet architectures with mViT have the tightest agreement across the range of GA areas.
Figure 6.
 
Grader evaluation of AI segmentation model performance on the test dataset using a four-level score. The stacked bar chart displays the percentage distribution of grader scores for each of the 12 models. Graders were presented with 13 distinct masks for scoring for each raw fundus autofluorescence image. The graders were masked to which segmentation was the ground truth as well as the AI architectures.
Figure 6.
 
Grader evaluation of AI segmentation model performance on the test dataset using a four-level score. The stacked bar chart displays the percentage distribution of grader scores for each of the 12 models. Graders were presented with 13 distinct masks for scoring for each raw fundus autofluorescence image. The graders were masked to which segmentation was the ground truth as well as the AI architectures.
Table 1.
 
GA Characteristics From Autofluorescence Images in the Internal Cross-Validation and External Validation/Test Datasets
Table 1.
 
GA Characteristics From Autofluorescence Images in the Internal Cross-Validation and External Validation/Test Datasets
Table 2.
 
Performance Metrics for the 12 AI Models Assessing GA Area in the Cross-Validation AREDS2 Dataset
Table 2.
 
Performance Metrics for the 12 AI Models Assessing GA Area in the Cross-Validation AREDS2 Dataset
Table 3.
 
Performance Metrics for the 12 AI Models Assessing GA Area in the GSK External Validation/Test Dataset
Table 3.
 
Performance Metrics for the 12 AI Models Assessing GA Area in the GSK External Validation/Test Dataset
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×