Purchase this article with an account.
Christopher J. Brady, Lucy Mudie, David S Friedman; Rasch modelling improves consensus scoring of crowdsourced data. Invest. Ophthalmol. Vis. Sci. 2017;58(8):4289.
Download citation file:
© ARVO (1962-2015); The Authors (2016-present)
Screening for diabetic retinopathy (DR) is cost-effective but underutilized, and novel methods are needed to simplify implementation of this sight-saving activity. We compared diagnostic DR grades of retinal fundus photographs provided by users of the Amazon Mechanical Turk (AMT) crowdsourcing marketplace with gold-standard grading, and explored whether determination of the consensus of crowdsourced classifications could be improved beyond a simple majority vote (MV) using regression methods.
One thousand two-hundred retinal images of individuals with diabetes mellitus from the Messidor public dataset were posted to AMT. Ten workers classified each image as normal or abnormal. If half or more workers judged the image to be abnormal, the MV “consensus” grade was designated as abnormal. Logistic regression was used to determine if a more accurate “consensus” could be devised. Finally, Rasch analysis was used to calculate worker ability scores in a random 50% training set, which were then used as weights in a regression model in the remaining 50% test set. Outcomes of interest were the percent correctly classified images, sensitivity, specificity, and area under the receiver-operator characteristic (AUROC) for the consensus grade as compared with the expert grading provided with the dataset.
Using MV grading, the consensus was correct in 75.5% of images, with a sensitivity and specificity of 75.5%, and an AUROC of 0.75 (95% Confidence Interval (CI) 0.73- 0.78). Using a logistic regression model with Rasch-weighted individual scores, 77.7% of images were graded correctly, with a specificity of 68.7%, and AUROC of 0.80 (0.76-0.83), using a diagnostic cut-point setting sensitivity at 90%. Across all diagnostic cut-points, the AUROC using the weighted scores increased to 0.91 (95% CI 0.88-0.93) from 0.89 (95% CI 0.86-92) for a model using unweighted scores (Fig.1, Chi2 p-value < 0.001).
Crowdsourced interpretation of retinal images provides rapid and accurate results as compared with a gold-standard grading. Creating a logistic regression model using Rasch analysis to weight crowdsourced classifications by worker ability improves accuracy of aggregated grades as compared with simple majority vote, and allows for tuning of the test to optimize the diagnostic characteristic of most relevance.
This is an abstract that was submitted for the 2017 ARVO Annual Meeting, held in Baltimore, MD, May 7-11, 2017.
This PDF is available to Subscribers Only