Abstract
Purpose :
Accurate evaluation of the optic disc is a key part of glaucoma assessment. The Glaucomatous Optic Neuropathy evaluation (GONE) project aimed to train graders to make accurate severity assessments aligning with an expert panel. In this project we evaluate the ability of machine learning systems to grade reference discs and emulate experts. From the models, insights can be gained into the decision making process used by experts.
Methods :
The GONE dataset includes 42 monoscopic optic disc images ranging from healthy to severe grades of glaucoma. 9 glaucoma predicting features were extracted: disc size (DS), disc shape (DSh), disc tilt (DT), peripapillary atrophy (PPA), cup to disc ratio (CDR), cup depth (CD), Hemorrhage (Ha) and nerve fibre layer (NFL) loss. Each disc was graded from 1-4 by 197 glaucoma graders: 37 glaucoma subspecialists, 51 comprehensive ophthalmologists and 109 ophthalmology trainees from 22 countries through the GONE project program during 2008-2010. From the original dataset, we centred each feature in the input matrix to be zero-meaned and created a new dataset of 500 randomly sampled observations using the bootstrap technique with sample replacement to overcome the problem of small sample size. The new dataset was partitioned into test and train subsets. Partial Least Square Regression (PLSR), Multivariate Adaptive Regression Spline (MARS), Random Forest (RF) and linear models (LM) were fitted to the data using R. Root Mean Squared Error (RMSE) was measured to test the agreement between the predictions and the actual data.
Results :
The LM (RMSE=0.28, R2=0.77) and PLSR (RMSE=0.34, R2=0.754) models did not fit the data well. Very good fits were obtained by RF (RMSE=0.01, R2=0.996) and MARS (RMSE=0.08, R2=0.933). Inferences could be made from MARS as it is quasi-linear. Seven features were used: CS, NFL loss, CDR, DS, CD, PPA and DT. DSh and Ha were unused by the model.
Conclusions :
Non-linear machine learning models can match the accuracy of experts in automated optic disc evaluation for glaucoma risk. The sample variance was fully explained by the input feature matrix suggesting that experts were not grading risk using other features (e.g. colour). The MARS model is simple to understand and implement and thus holds promise for use in automated diagnosis systems. The fact that Ha and DSh were not used by the model suggests that the variance from these features was already within the other features.
This is an abstract that was submitted for the 2016 ARVO Annual Meeting, held in Seattle, Wash., May 1-5, 2016.