Abstract
Purpose :
As deep learning applications to ophthalmology imaging increase in sophistication and clinical relevance, there is limited knowledge about the extent to which algorithm performance can predictably generalise. This study set out to determine how the accuracy of a fundus photo classifier built on one dataset can be replicated in a second dataset captured on a different device.
Methods :
25, 000 high quality fundus photos were manually selected from the UK Biobank (UKBB) (Topcon 3D OCT-1000, field angle 45°). A simple deep transfer learning model based on VGG architecture was built to classify images into right vs left eyes. This untouched algorithm was then validated on 2 smaller samples (n=430) of the fundus photos (Optos® California, field angle 200°) from Mass. Eye and Ear Infirmary (MEEI); the first sample was cropped to the posterior pole (MEEI-a) to approximate the region captured by the UKBB sample and the second same (same images) was cropped to the circular fundus edge (MEEI-b). The same process was then repeated in reverse; a model constructed on MEEI images was deployed on UKBB images.
Results :
The UKBB laterality classification model (LCM) achieved AUROC 0.997. When evaluated on dataset MEEI-a and MEEI-b, the resulting AUROC’s were 0.944 and 0.778 respectively. The LCM subsequently built on MEEI-a achieved AUROC 0.991. When evaluated on MEEI-b and UKBB datasets, performance dropped to AUROC’s of 0.545 and 0.713 respectively.
Conclusions :
Simple and accurate algorithms generalize variably across devices and image setting. This finding highlights the importance of validation studies prior to deployment for clinical use.
This is a 2020 ARVO Annual Meeting abstract.