Abstract
Purpose:
To refine and externally validate a novel method for fundus photograph grading.
Methods:
A crowd-sourcing interface for fundus photo classification developed for Amazon Mechanical Turk (AMT), including annotated training images was refined based on user feedback. In Phase 1, nineteen expert-graded images were posted for categorization into 4 severity categories by AMT workers (Turkers), with 10 repetitions per photo. Three sequential batches were posted with iterative refinements to the interface. In Phase 2, 400 images from the MESSIDOR public datastet of non-mydriatic fundus photos were posted using the refined interface from Phase 1, asking Turkers to categorize the images as normal or abnormal. In Phase 3, iterative improvements were made to the interface in an attempt to further refine accuracy using the Messidor dataset. The main outcome measure was proportion of images with matching consensus Turker and expert/gold-standard score.
Results:
Across 190 grading instances in Phase I, Turker consensus accuracy in 4-category grading increased to a maximum of 52.6% from 26.3%. Turker accuracy at categorizing the images as normal vs. abnormal increased to 100% from a baseline of 89.5%. Throughout, 100% sensitivity for normal vs. abnormal was maintained. Maximum specificity was 85.7%. Across 4000 grading instances in Phase 2, Turkers had an overall accuracy of 68.5%. Excluding the first two MESSIDOR disease categories, level 1 (<5 microaneurysms (MA)) and level 2 (<15 MA or <5 hemorrhages), accuracy increased to 80.9% with a sensitivity of 92.4% and specificity of 78.0%. Four out of 53 cases (7.5%) of level 3 (≥15 MA or ≥5 hemorrhages or neovascularization) retinopathy were missed.
Conclusions:
With minimal training, the AMT workforce can rapidly and correctly categorize fundus photos of diabetic patients as normal or abnormal when a moderate to severe amount of disease is present. Further refinement is required for Turkers to identify subtle disease, and correctly categorize the level of disease. That Turker accuracy was preserved using a different dataset than that with which the interface was developed is a critical validation. Images were interpreted for a total cost of $1.10 per eye. Crowdsourcing may offer a novel and inexpensive means to reduce the skilled grader burden and increase screening for diabetic retinopathy in some settings.