Abstract
Purpose :
In the last decade there have been vast advancements in artificial intelligence (AI) in ophthalmology. However, reporting in AI literature is highly unstandardized and algorithmic fairness remains challenging to assess. In this preliminary study, we evaluate 59 studies on the development, validation, and trialing of AI tools for referable diabetic retinopathy (RDR) diagnosis, on measures of transparency. To do so, we employ a scoring system using an AI model card, a framework for benchmarked assessment of algorithmic fairness.
Methods :
We identified 59 studies on AI algorithms for RDR diagnosis using fundus photos. 17 studies reported on algorithm training and internal validation, 26 studies on external validation, and 16 studies on prospective, clinical validation of RDR algorithms. We apply our model card scoring system to these studies to broadly assess algorithm transparency. Model card scored elements include basic model details (i.e. model version), elements of intended use, input/output definitions and architecture, training and evaluation dataset details (i.e, source, size, demographics), performance measures (AUC, sensitivity (SE) and specificity (SP)), and ethical factors relating to algorithm bias.
Results :
Out of a total possible score of 22, clinical validation studies scored an average of 16.7 (range 13- 20), representing a moderate level of transparency. Only 1 clinical validation study defined a clear scope of use and only 3/16 studies reported data on race. While nearly all reported sensitivity and specificity, only 9/16 studies reported AUC and only 4/16 reported imageability. Clinical validation studies were conducted on an average of 1094 patients, ranging from 143-4381 patients. Average AUC, SE and SP was 0.9305, 90.8%, and 85.8% respectively. Similarly, reporting on training and external validation varied widely. 6/43 studies reported race data. Training datasets ranged from 89 to 466,247 images, averaging 52,035 images. Average AUC, SE, and SP was 0.960, 90.9% and 89.46% for training algorithms respectively. Average AUC, SE and SP of externally validated algorithms was 0.942, 92.4%, and 86.17% respectively.
Conclusions :
Our results demonstrate a high level of variability in reporting of AI algorithms for RDR, with many clinical validation studies demonstrating moderate or poor levels of transparency. Model cards may help in promoting fairness and standardization of AI reporting.
This abstract was presented at the 2022 ARVO Annual Meeting, held in Denver, CO, May 1-4, 2022, and virtually.