The performance of the machine learning algorithm and the two human observers was compared separately to the reference standard by using Cohen's
κ agreement and receiver operating characteristic (ROC) analysis. Bootstrap analysis
42 was performed to obtain the mean ROC curve and the 95% confidence intervals. We performed two different experiments to evaluate the performance of the algorithm in different scenarios: (1) AMD grading into five severity stages, as shown in
Table 1, in the test set; and (2) AMD high-risk level identification in the test set and the external set. Instead of a detailed AMD staging, in experiment 2 the algorithm was retrained for the binary task of identifying high-risk patients for progression to AMD, by grouping severity grades 1 to 2 into low risk and 2 to 5 into high risk. This experiment allowed comparison with previous works
29 and the assessment of the generalizability of the algorithm to data from a different source. The area (
Az) under the ROC curve and sensitivity/specificity values were used as a performance measure for experiment 2. For experiment 1, overall agreement between the reference standard and the algorithm output and the observers' opinion was calculated by using
κ statistics (SPSS, v20.0.0; IBM Corp., Armonk, NY, USA). The parameters of the algorithm, namely, the number
M of patches per OCT volume, the patch size
n, the number
p of principal components, and the number
k of visual words per AMD stage, were optimized by using one-eighth of the training set. The parameter
M has to be set high enough to accurately capture the characteristics of the OCT volume; it was set to 10,000 patches. A higher number of patches had no effect on the performance. A grid search was performed over the remaining three parameters. The parameter
n was varied between 11 and 61 pixels,
k was varied between 50 and 2500 visual words, and
p was varied from 10 to 150 components. The optimal values were identified as
n = 61 pixels,
k = 2500 visual words, and
p = 100 components.