Abstract
Purpose :
Previous research has demonstrated that the clinical effectiveness of computer aided diagnostic (CAD) systems may differ significantly from the results obtained through carefully controlled research trials. Our goal was to conduct a case study of a deep learning (DL) based CAD system deployed in a clinical setting, to understand the real world factors that may affect the performance, use, and evaluation of these tools in practice.
Methods :
A prospective single-reader study with n=1000 patients, using a DL system designed to assist in the diagnosis of Diabetic Retinopathy (DR). For each patient, three non-mydriatic fundus images of each eye were taken and sent electronically to an experienced reader (clinician) for diagnosis. After initial grading (five-point DR scale), the reader viewed the diagnosis provided by the DL system (five-point scale) and was asked to provide their final diagnosis. At any point the reader was allowed to consult a retina specialist.
Results :
After seeing the DL results, the reader changed their grade in about 3% of cases, resulting in a 20% increase in the number of Proliferative Diabetic Retinopathy cases detected. The reader consulted a specialist on 4% of cases before and an additional 8% after seeing the DL output. The reader’s final diagnosis from the DL diagnosis in 20% of the cases. A separate group of retina specialists adjudicated these discordant cases. The false positive rate for the reader was almost unchanged at <5% before and after seeing the DL output despite a high false positive rate of 25% for the DL system. Note that the DL output was conservative in adopting the most severe grading for any one of the 3 images per eye regardless of whether the system judged that image “gradable”. DL false positives decreased from 20% to 10% if the “ungradable” images (38% of the total) are removed from analysis.
Conclusions :
Pragmatic factors such as whether and when a system rejects an image as being ungradable or the availability of an immediate expert consult can have major impacts on the clinical usefulness and efficacy of DL CAD systems in practice. Pilot deployments and initial field tests of these systems should monitor the effect of these factors over time in order to create the safest systems with the most value to patients.
This is a 2020 ARVO Annual Meeting abstract.