Abstract
Purpose :
Self-supervised pre-training has demonstrated efficacy in yielding deep learning (DL) models with remarkable data efficiency and generalization capabilities. Retinal imaging has an untapped potential to exploit such approaches by leveraging matched multimodal data in the form of 2D fundus photography/near-infrared reflective imaging (NIR) and 3D spectral domain optical coherence tomography (SD-OCT) scans. We explore multimodal pre-training to enhance DL models on downstream predictive tasks: disease classification, structure-function prediction, and treatment forecasting.
Methods :
We propose a multi-modal contrastive pre-training method for retinal imaging (Fig. 1a), for which we utilized extensive longitudinal reading center data comprising 153,306 pairs of OCT volumes and corresponding fundus photography (Topcon) or NIR (Spectralis & Cirrus) images from 3,790 neovascular age-related macular degeneration patients. The pre-training aimed to bring similar instances (fundus and OCT from the same eye) close in the latent space and push dissimilar instances apart, fostering meaningful embeddings. Linear predictive models were built on pre-trained encoder blocks (Fig. 1b) and trained on external HARBOR clinical trial data for visual acuity, fluid presence, high treatment need prediction, and a mixed diseases dataset of clinical trial baseline scans for retinal disease screening.
Results :
Our results highlight the superiority of the multi-modal contrastive pre-trained encoder-based models over the fully supervised ones across all downstream tasks (Tab. 1), confirming the efficacy of capturing relevant biomarkers through pre-training. Notably, the pre-training also enhances fundus-based prediction performance. Exploring the adaptability of our approach, we observed minimal performance decay (≤20%) when swapping imaging modalities for predictions post-OCT-based model training, underscoring its robustness and the possibility of leveraging the close mapping of image-volume pairs in the latent space.
Conclusions :
In summary, this study underscores the capacity of multi-modal contrastive pre-training to harness extensive unlabeled data, presenting a promising starting point for tasks concerning image interpretation in retinal research and clinical care. Furthermore, by enhancing 2D fundus representations, our simple yet effective method may serve in (pre)-clinical settings where access to OCT is limited.
This abstract was presented at the 2024 ARVO Annual Meeting, held in Seattle, WA, May 5-9, 2024.