Abstract
Purpose :
A lack of high quality, labelled medical datasets is preventing the development and implementation of machine learning (ML) models in the healthcare setting. We explore the use of generative ML models to produce synthetic Optical Coherence Tomography (OCT) B scans using two novel methods: Generative Adversarial Networks (GANs) and Diffusion Models. These models can be used to generate 'pre-labelled' OCT B scans at scale, which could then be freely deployed for further research use to, for example, augment pre-existing image classification models. Further, we evaluate these synthetic images to assess their appropriateness for use in the clinical environment.
Methods :
We train two state-of-the-art networks, StyleGAN3 and Stable Diffusion, to produce native resolution OCT B scans across multiple pathological domains: Healthy, drusen, and hydroxychloroquine (HCQ) retinopathy. We train two distinct classification networks: One trained on our synthetic data, one trained on real data. We then evaluate the performance of these models on a test set of real images and compare their performance. These models are trained to classify a given B scan as either Normal or Drusen. Further, we show a mixture of real and synthetic images to two clinical specialists to evaluate whether synthetic images are indistinguishable from real ones.
Results :
Our models produce scans at a resolution of 512x512 (normal and drusen) and 1024x1024 (HCQ), amongst the highest resolution B scans in the literature. In an initial qualitative analysis, several specialists are unable to tell that these are synthetic with consistency. However, with training, experts become increasingly able to distinguish between real and synthetic images with a high probability, and we identify several failure modes of synthetic images. Accuracy of our classification models when trained with synthetic images is significantly worse compared with a model trained on real images (87% vs 98.6%).
Conclusions :
Synthetic imaging data has several useful clinical applications, however, our work suggests ML models and humans can trivially distinguish such images from real ones. By enumerating the failure modes of these images, we provide a basis to improve generative models to produce more realistic images in future.
This abstract was presented at the 2023 ARVO Annual Meeting, held in New Orleans, LA, April 23-27, 2023.