Investigative Ophthalmology & Visual Science Cover Image for Volume 65, Issue 9
July 2024
Volume 65, Issue 9
Open Access
ARVO Imaging in the Eye Conference Abstract  |   July 2024
ViT features for diabetic retinopathy grading and lesion segmentation
Author Affiliations & Notes
  • Olivia Kay
    Ashbury College, Ottawa, Ontario, Canada
  • Keith Miller
    University of Michigan Medical School, Ann Arbor, Michigan, United States
  • Mickey Nguyen
    University of Michigan Medical School, Ann Arbor, Michigan, United States
  • Footnotes
    Commercial Relationships   Olivia Kay, None; Keith Miller, None; Mickey Nguyen, None
  • Footnotes
    Support  None
Investigative Ophthalmology & Visual Science July 2024, Vol.65, PB0061. doi:
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Olivia Kay, Keith Miller, Mickey Nguyen; ViT features for diabetic retinopathy grading and lesion segmentation. Invest. Ophthalmol. Vis. Sci. 2024;65(9):PB0061.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose : Transformer architectures have shown great success in computer vision. We evaluate the utility of transformer encoder features in training diabetic retinopathy classification and segmentation networks.

Methods : Extracted MAE pre-trained ViT features for resized and/or cropped (leaving only retinal pixels) images from MESSIDOR and FGADR datasets (0.8-0.2 split training-testing). For MESSIDOR, we also considered training on images from two centers, testing on the third. PCA followed by k-means clustering was performed on the encodings.

4,417 and 119,043 parameter convolutional architectures were trained for classification. A 12.5 million parameter convolutional network with skip connections was trained for segmentation.

A comparison ResNet50 network (25.5 million parameters) was trained from the original images (standard normalization, not augmented for fair comparison).

Results : Clustering the MESSIDOR embeddings recovered perfectly the three hospitals which contributed data. Embeddings of cropped squares with only retinal pixels from MESSIDOR likewise showed 3 prominent clusters, two included data from all centers and disease stages, while one included data from only one center, but all disease stages.

Classification:

The baseline ResNet achieved 61% (MESSIDIOR) and 63% (FGADR) testing accuracy, while training accuracy reached 99.9%. 80% dropout before the final linear layer did not affect overtraining but caused FGADR testing accuracy to minimally increase to 65%.

Training from embeddings, the classification networks achieved 62% (4k features) and 66% (119k features) accuracy on FGADR. Both were 62% accurate on MESSIDIOR; accuracy on data from a center unseen during training dropped to 46% (baseline 55%).

Training converged in under 30 epochs. 30-80% dropout before the first and final layers made the 119k architecture robust to overtraining.

Segmentation:

Training converged in 5 epochs to 10-29% precision and 1% recall. Poor performance is specifically on images where the lesions are smaller than the 16x16 pixel patches used for the encoding. Such lesions make up most of the data.

Conclusions : Working from ViT embeddings allows for significantly smaller networks trainable on CPU and robust to overtraining. However, segmentation of small lesions requires more precision. The network successfully identified the moderate-size lesions, but smaller attention blocks may achieve significant improvements.

This abstract was presented at the 2024 ARVO Imaging in the Eye Conference, held in Seattle, WA, May 4, 2024.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×