Abstract
Purpose:
Understanding the causes of disagreement among experts in clinical decision making has been a challenge for decades. In particular, diagnosis of retinopathy of prematurity (ROP) shows a high amount of variability. Computer-based image analysis is one approach to improving diagnostic variability. However, a critical unanswered question is to understand discrepancies in the sets of retinal vascular features considered by different experts during diagnosis. We propose a methodology that makes use of machine learning to understand the underlying causes of inter-expert variability.
Methods:
A set of 34 retinal images were diagnosed by 22 independent experts. Feature selection (FS) is applied to discover the most important features considered by a given expert. These are compared in turn with those of the rest of experts by applying similarity measures. Finally, an automated classification system with the most relevant features is built to check if this approach can be helpful in ROP diagnosis.
Results:
The experimental results reveal that the top selected features regardless of the considered expert are: mean of venous and arterial tortuosity (for 100% and 47% of experts), mean of venous acceleration (42% of experts), and maximum main branch leaf node factor in arteries (68% or experts). For pairs of experts with high percentage of inter-agreement, the FS methods also select similar features. These findings suggest that besides taking into account the standard features (arterial tortuosity and venous dilation), the experts may be considering other features, and that this may be a source of disagreement. Finally, we built an automatic system using the relevant selected features, with which the classification accuracy was improved from 68% to 80% when distinguishing plus, pre-plus and neither; and maintained when classifying into plus or not plus, showing 88% accuracy. The high Williams’ indices obtained by our system (greater than 1) reinforce the idea that it shows a behavior similar to that of expert clinicians.
Conclusions:
We provide a handy framework to identify important features for experts and check whether selected features reflect the pairwise disagreements. These findings may lead to improved ROP diagnostic accuracy and standardization among clinicians, and may be generalizable to other clinical problems.