Abstract
Purpose :
Transformer architectures have shown great success in computer vision. We evaluate the utility of transformer encoder features in training diabetic retinopathy classification and segmentation networks.
Methods :
Extracted MAE pre-trained ViT features for resized and/or cropped (leaving only retinal pixels) images from MESSIDOR and FGADR datasets (0.8-0.2 split training-testing). For MESSIDOR, we also considered training on images from two centers, testing on the third. PCA followed by k-means clustering was performed on the encodings.
4,417 and 119,043 parameter convolutional architectures were trained for classification. A 12.5 million parameter convolutional network with skip connections was trained for segmentation.
A comparison ResNet50 network (25.5 million parameters) was trained from the original images (standard normalization, not augmented for fair comparison).
Results :
Clustering the MESSIDOR embeddings recovered perfectly the three hospitals which contributed data. Embeddings of cropped squares with only retinal pixels from MESSIDOR likewise showed 3 prominent clusters, two included data from all centers and disease stages, while one included data from only one center, but all disease stages.
Classification:
The baseline ResNet achieved 61% (MESSIDIOR) and 63% (FGADR) testing accuracy, while training accuracy reached 99.9%. 80% dropout before the final linear layer did not affect overtraining but caused FGADR testing accuracy to minimally increase to 65%.
Training from embeddings, the classification networks achieved 62% (4k features) and 66% (119k features) accuracy on FGADR. Both were 62% accurate on MESSIDIOR; accuracy on data from a center unseen during training dropped to 46% (baseline 55%).
Training converged in under 30 epochs. 30-80% dropout before the first and final layers made the 119k architecture robust to overtraining.
Segmentation:
Training converged in 5 epochs to 10-29% precision and 1% recall. Poor performance is specifically on images where the lesions are smaller than the 16x16 pixel patches used for the encoding. Such lesions make up most of the data.
Conclusions :
Working from ViT embeddings allows for significantly smaller networks trainable on CPU and robust to overtraining. However, segmentation of small lesions requires more precision. The network successfully identified the moderate-size lesions, but smaller attention blocks may achieve significant improvements.
This abstract was presented at the 2024 ARVO Imaging in the Eye Conference, held in Seattle, WA, May 4, 2024.