Purchase this article with an account.
Christopher J Brady, Fahd Naufal, Meraf A Wolle, Harran Mkocha, Sheila K West; Crowdsourcing Can Match Field Grading Validity for Follicular Trachoma. Invest. Ophthalmol. Vis. Sci. 2021;62(8):1788.
Download citation file:
© ARVO (1962-2015); The Authors (2016-present)
As trachoma is eliminated, field graders lose exposure to the disease and become less adept at identifying follicular trachoma (TF). New solutions to complete field surveys including photography and telemedicine may be needed to ensure elimination and accurately monitor for re-emergence. Expert grading of images is costly and time intensive. Our purpose was to validate crowdsourcing for follicular trachoma image interpretation.
Tarsal plate images acquired using a smartphone-based device during a 2019 field survey in Tanzania (n=1000) were posted to the Amazon Mechanical Turk (AMT) crowdsourcing marketplace for grading as "not-TF," “possible TF,” “probable TF” or “definite TF.” Each image was graded by 7 unique graders who received $0.05 USD per image. The grades were summed to create a raw score (0-21) which was analyzed by receiver-operating characteristic (ROC) using images with concordant field and expert photo grades to determine the optimal diagnostic set-point. Kappa, sensitivity and specificity were then analyzed at various prevalences of disease.
7000 grades were rendered in 1 hour for $420 USD. The raw score produced an area under the ROC of 0.940 (95% CI 0.902-0.977). Optimizing the setpoint to a raw score of 7 produced a kappa of 0.43, sensitivity of 84.8%, specificity of 90% and % correct (to master/field grade) of 89.3% in the full sample with a prevalence of 5.7% TF. When normal images were randomly removed from the sample to mimic the prevalences used to validate field graders (30% & 75% TF), the kappa ranged from 0.71-0.74 which is within the acceptable range per the World Health Organization. Images with discordant field and expert grades were more likely to receive a raw score in the middle of the range suggesting disagreement among crowdsourcers as well.
Crowdsourcing was able to rapidly and accurately identify TF on smartphone-acquired photographs with minimal training. Agreement with a reference standard is poor in a sample with low TF prevalence, but when held to the same standard as a skilled field grader in the current training paradigm, crowdsourcing may be acceptable. Further testing compared to field grading in low prevalence areas is needed.
This is a 2021 ARVO Annual Meeting abstract.
Receiver-operating characteristic for raw crowdsourcing score for images with concordant field and expert photograph grade.
Distribution of field and expert grade within each crowdsourced raw score
This PDF is available to Subscribers Only