Purchase this article with an account.
Xueyang Wang, Lucy I Mudie, Baskaran Mani, Ching-Yu Cheng, David S Friedman, Christopher J. Brady; Crowdsourcing to evaluate fundus photos for the presence of glaucoma. Invest. Ophthalmol. Vis. Sci. 201657(12):.
Download citation file:
© ARVO (1962-2015); The Authors (2016-present)
Glaucoma is the second leading cause of blindness worldwide, and screening efforts are resource intensive. Crowdsourcing engages individuals in the community to complete tasks collectively. We assessed the accuracy of crowdsourcing for grading optic nerve images for glaucoma using Amazon Mechanical Turk (AMT) before and after providing a training module.
Images (n=60) from three large population studies were graded for glaucoma status (Definite or Probable, Possible, No Glaucoma). In the baseline trial, users on AMT (Turkers) graded fundus photos for glaucoma after consulting sample images. In a follow-up trial, Turkers were given a 26-slide training and had to complete a quiz before being able to grade the same 60 images. The quiz was scored out of 12; 2 points were given for exact accuracy, 1 point for accuracy within 1 glaucoma grade, and 0 if incorrect. The passing quiz score was 6/12. Each image was graded by 10 unique Turkers in all trials for $0.10 per image. The mode of Turker grades for each image was compared to an adjudicated expert grade to determine agreement as well as the sensitivity and specificity of Turker grading. Spearman’s rank correlation coefficient was calculated for the association between Turkers’ quiz score and proportion of images graded correctly.
In the baseline study, 19 images (32%) were graded correctly, with κ=-0.0327 indicating poor agreement with expert grading. The area under the curve (AUC) of the receiver operating characteristic (ROC) curve was low, 0.51 (95% CI: 0.36-0.65). Most Turkers took less than 25 minutes to complete the training module and pass the quiz. Post-training, 19 Turkers performed 600 gradings within 28 hours for less than $130; 33 images (55%) were graded correctly, with κ=0.3333, and AUC of 0.76 (95% CI: 0.64-0.87). There was poor association between Turkers’ quiz scores and proportion of images graded accurately; Spearman’s coefficient was 0.27 (p=0.28) after the removal of 1 outlier.
Turkers graded 60 fundus images quickly and at low cost pre- and post-training. Turker grading accuracy, sensitivity and specificity all improved after completing the training module and passing the quiz. With effective training, crowdsourcing may be an efficient tool to screen retinal images for glaucoma, which may lead to earlier detection of disease for individuals at risk.
This is an abstract that was submitted for the 2016 ARVO Annual Meeting, held in Seattle, Wash., May 1-5, 2016.
Association between proportion of images graded correctly and quiz score.
This PDF is available to Subscribers Only