purpose. To determine which machine learning classifier learns best to interpret standard automated perimetry (SAP) and to compare the best of the machine classifiers with the global indices of STATPAC 2 and with experts in glaucoma.

methods. Multilayer perceptrons (MLP), support vector machines (SVM), mixture of Gaussian (MoG), and mixture of generalized Gaussian (MGG) classifiers were trained and tested by cross validation on the numerical plot of absolute sensitivity plus age of 189 normal eyes and 156 glaucomatous eyes, designated as such by the appearance of the optic nerve. The authors compared performance of these classifiers with the global indices of STATPAC, using the area under the ROC curve. Two human experts were judged against the machine classifiers and the global indices by plotting their sensitivity–specificity pairs.

results. MoG had the greatest area under the ROC curve of the machine classifiers. Pattern SD (PSD) and corrected PSD (CPSD) had the largest areas under the curve of the global indices. MoG had significantly greater ROC area than PSD and CPSD. Human experts were not better at classifying visual fields than the machine classifiers or the global indices.

conclusions. MoG, using the entire visual field and age for input, interpreted SAP better than the global indices of STATPAC. Machine classifiers may augment the global indices of STATPAC.

^{ 1 }

^{ 2 }

^{ 3 }This fluctuation makes the identification of glaucoma and the detection of its progression difficult to establish.

^{ 4 }

^{ 5 }

^{ 6 }detect visual field progression,

^{ 7 }assess structural data from the optic nerve head,

^{ 8 }and identify noise from visual field information.

^{ 9 }Neural networks have improved the ability of clinicians to predict the outcome of patients in intensive care, diagnose myocardial infarctions, and estimate the prognosis of surgery for colorectal cancer.

^{ 10 }

^{ 11 }

^{ 12 }We applied a broad range of popular or novel machine classifiers that represent different methods of learning and reasoning.

^{ 13 }angle abnormalities on gonioscopy, any diseases other than glaucoma that could affect the visual fields, and medications known to affect visual field sensitivity. Subjects with a best-corrected visual acuity worse than 20/40, spherical equivalent outside ±5.0 diopters, and cylinder correction >3.0 diopters were excluded. Poor quality stereoscopic photographs of the optic nerve head served as an exclusion for the glaucoma population. A family history of glaucoma was not an exclusion criterion.

*n*= 95) for whom photography was not available. These normal subjects had no evidence of optic disc damage with dilated slit-lamp indirect ophthalmoscopy with a hand-held 78 diopter lens.

^{ 14 }Global indices are statistical classifiers tailored to SAP: mean deviation (MD), pattern SD (PSD), short-term fluctuations (SF),

^{ 15 }corrected pattern SD (CPSD), and glaucoma hemifield test (GHT).

^{ 16 }The clinician uses these plots and indices to estimate the likelihood of glaucoma from the pattern of the visual field.

^{ 17 }The data are projected onto their principal components. The first principal component lies along the axis that shows the highest variance in the data. The others follow in a similar manner such that they form an orthogonal set of basis functions. For the PCA basis, the covariance matrix of the data are computed, and eigenvalues of the matrix are ordered in a decreasing manner.

^{ 18 }Multilayer perceptrons (MLP), support vector machines (SVM), mixture of Gaussian (MoG), and mixture of generalized Gaussian (MGG) are effective machine classifiers with different methods of learning and reasoning. The following paragraphs describe the training of each classifier type. Readers who want detailed descriptions with references of the machine classifiers will find them in the Appendix.

^{ 19 }

^{ 20 }

^{ 21 }

^{ 22 }

^{ 23 }

^{ 24 }

^{ 25 }

^{ 26 }

^{ 27 }

^{ 28 }

^{ 29 }

^{ 30 }

^{ 31 }

^{ 32 }

^{ 33 }

*y*for a given input vector,

**x**, was

*y*(

**x**) = sign ( \({{\sum}_{\mathit{i}\mathrm{{=}1}}^{\mathit{p}}}\) α

_{ i }

*y*

_{ i }

*K*(

**x**,

**x**

_{ i }) +

*b*), where

*b*was the bias, and the coefficientsα

_{ i }were obtained by training the SVM. The SVMs were trained by implementing Platt’s sequential minimal optimization algorithm in MATLAB.

^{ 34 }

^{ 35 }

^{ 36 }The training of the SVM was achieved by finding the support vector components,

**x**

_{ i }and the associated weights,

**α**

_{ i }. For the linear function,

*K*(

**x**,

**x**

_{ i }), the linear kernel was (

**x**·

**x**

_{ i }), and the Gaussian kernel was exp(−0.5(

**x**−

**x**

_{ i })

^{2}/ς

^{2}). The penalty used to avoid overfit was

*C*= 1.0 for either the linear or Gaussian kernel. With the Gaussian kernel, the choice of ς depended on input dimension, ς ∝ \(\sqrt{53}\) or ς∝ \(\sqrt{8}\) . The output was constrained between 0 and 1 with a logistic regression. If the output value was on the positive side of the decision surface, it was considered glaucomatous; if it was on the negative side of the decision surface, it was considered nonglaucomatous. When generating the ROC curve, scalar output of the SVMs was extracted so that the decision threshold could be varied to obtain different sensitivity–specificity pairs for the ROC curve.

**x**, we computed

*P*[

**x**| \(\overline{G}\) ] and

*P*[

**x**|

*G*]. From these conditional probabilities, we could obtain the probability of glaucoma for a given SAP,

**x,**by Bayes rule.

^{ 37 }MGG was trained and tested only with input reduced to eight dimensions by PCA.

^{ 38 }

^{ 39 }

^{ 40 }

^{ 41 }

^{ 42 }The statistical test we used for significant difference between ROC curve areas was dependent on the correlation of the curves (Table 2) .

^{ 39 }Without preselection of the comparisons, there were 45 comparisons of classifiers. For α = 0.05, the Bonferroni adjustment required

*P*≤ 0.0011 for the difference to be considered significant (Table 2) .

^{ 43 }The training-test process was repeated until each partition had an opportunity to be the test set. Because the classifier was forced to generalize its knowledge on previously unseen data, we determined the actual error rate.

^{ 39 }Glaucoma experts consider the cost of a false positive to be greater than a false negative. A high specificity is desirable because the prevalence of glaucoma is low and progression is very slow. The left-hand end of the ROC curves are of interest when high specificity is desired. Consequently, we also compared the sensitivities at specificities 0.9 and 1.0.

*P*= 0.0038, compared with the Bonferroni cutoff of 0.0011). There was poor correlation between LDF with PCA and between PSD and CPSD (ρ = 0.48 and 0.38, respectively).

*P*= 0.0009), because there was higher correlation between the curves for PSD and MoG constrained to QDF. Removing age from the data set lowered the area under the curve for MoG constrained to QDF by 0.008 (from 0.917 to 0.909). Though MoG with PCA reported a higher sensitivity (0.673) at specificity 1 than GHT (0.667), these values were similar.

^{ 8 }

^{ 44 }

^{ 45 }Spenceley et al.

^{ 45 }reported sensitivity of 0.65 and 0.90 at specificities 1.0 and 0.96, respectively, with MLP; the MLP was taught which fields were glaucomatous and which were normal from an interpretation of the fields by an observer. We obtained sensitivities of 0.67 and 0.73 at these specificities with MoG, our best classifier; the machine classifiers were taught which fields were glaucomatous and which were normal from an interpretation of the optic nerve for the presence of GON by the consensus of two observers. Researchers using pattern recognition methodology consider that an indicator other than the test being evaluated should be used as a gold standard for teaching the classifiers. Also, if the human interpretation of the visual field is used as the indicator for teaching the classifier, the classifier cannot exceed the human interpreter in accuracy. With GON as the indicator of disease, we found that the MoG machine classifiers generated ROC curves that were higher than the sensitivity–specificity pairs from glaucoma experts. Other studies used MLPs for automated diagnosis.

^{ 5 }

^{ 8 }

^{ 44 }

^{ 45 }We found that the ROC curves of the MoG machine classifiers were higher than the curve for MLP, particularly in the high-specificity region. This observation implies that the MoG classifiers perform better than the MLP used in previous reports.

^{ 46 }

^{ 47 }and frequency-doubling technology perimetry (FDT),

^{ 48 }

^{ 49 }with which clinicians and researchers have less experience. It is likely that machine classifiers will be able to learn from these data and exceed the ability of glaucoma experts in interpreting these tests.

^{ 19 }

^{ 20 }

^{ 21 }The MLP has been successfully applied to a wide class of problems, such as face recognition

^{ 22 }and character recognition.

^{ 23 }The architecture is a universal feed-forward network; the input layer and output layer of nodes are separated by one or more hidden layers of nodes. The hidden layers act as intermediary between the input and output layers, enabling the extraction of progressively useful information obtained during learning. The activation function of each neuron uses a sigmoid function to approximate a threshold or step. The use of a continuous sigmoid function instead of a step function enables the generation of an error function for correcting the weights. The sigmoid function may be logistic or hyperbolic tangent.

^{ 24 }

^{ 25 }They exploit statistical learning theory to minimize the generalization error when training a classifier. SVMs have generalized well in face recognition,

^{ 26 }text categorization,

^{ 27 }recognition of handwritten digits,

^{ 28 }and breast cancer diagnosis and prognosis.

^{ 29 }

*f*(

*u*) = sign(

*u*) = sign(

**w**

^{ T }

**x**+

*b*), where

**x**is the input vector,

**w**is the adjustable weight vector,

**w**

^{ T }

**x**+

*b*= 0 is the hyperplane decision surface,

*f*(

*u*) = −1 designates one class (e.g., normal) and

*f*(

*u*) = 1 the other class (e.g., glaucoma). For linearly separable data, the parameters

**w**and

*b*are chosen such that the margin (∝1/|

**w**|) between the decision plane and the training examples is at maximum. This results in a constrained quadratic programming (QP) problem in search for the optimal weight

**w**.

**w**= \({{\sum}_{\mathit{i}\mathrm{{=}1}}^{\mathit{p}}}\) α

_{ i }

*y*

_{ i }

**x**, where

*p*is the number of support vectors,

*α*

_{ i }is the contribution from the support vector

**x**

_{ i }, and

*y*

_{ i }is the training label. The output of the SVM is

*u*(

**x**) = \({{\sum}_{\mathit{i}\mathrm{{=}1}}^{\mathit{p}}}\) α

_{ i }

*y*

_{ i }

**x**

_{ i }

^{ T }

**x**+

*b*. Instead of a hard (glaucoma or not glaucoma) decision function, we convert the SVM output

*u*(

**x**) into a probabilistic one, using a logistic transformation.

^{ 36 }

**x**), where an optimal hyperplane can be found to minimize classification errors.

^{ 30 }In this new space, the classes of interest in the pattern classification task are more easily distinguished. Although the separating hyperplane is linear in this high dimensional space induced by the non-linear mapping, the decision surface found by mapping back to the original low-dimensional input space will not be linear any more. As a result, the SVMs can be applied to data that are not linearly separable.

*u*(

**x**) \({{\sum}_{\mathit{i}\mathrm{{=}1}}^{\mathit{p}}}\) α

_{ i }

*y*

_{ i }

*K*(

**x**

_{ i },

**x**)+

*b*, where

*K*(

**x**

_{ i },

**x**) =φ⃗

^{ T }(

**x**) φ⃗(

**x**

_{ i }) and is called the kernel function. A full mathematical account of the SVM model is described by Vapnik.

^{ 24 }

^{ 31 }In committee machines, a computationally complex task is solved by dividing it into a number of computationally simple tasks.

^{ 32 }For the supervised learning, the computational simplicity is achieved by distributing the learning task among a number of “experts” that divide the input space into a set of subspaces. The combination of experts makes up a committee machine. This machine fuses knowledge acquired by the experts to arrive at a decision superior to that attainable by any one expert acting alone. In the associative mixture of Gaussian model (MoG), the experts use self-organized learning (unsupervised learning) from the input data to achieve a good partitioning of the input space. Each expert does well at modeling its own subspace. The fusion of their outputs is combined with supervised learning to model the desired response.

^{ 37 }we are able to model the class conditional densities with higher flexibility, while preserving a comprehension of the statistical properties of the data in terms of means, variances, kurtosis, etc. This just-evolved approach was developed at the Salk Institute computational neurobiology laboratory. The independent component analysis mixture model can model various distributions, including uniform, Gaussian, and Laplacian. It has been demonstrated in real-data experiments that this model generally improves classification performance over the standard Gaussian mixture model.

^{ 33 }The mixture of generalized Gaussians (MGG) uses the same mixture model as MoG. However, each cluster is now described by a linear combination of non-Gaussian random variables.

Sensitivity at Specificity = 1 | Sensitivity at Specificity = 0.9 | Specificity of Expts. | Sensitivity of Expts. | ROC Area ± SE | |
---|---|---|---|---|---|

Human Experts on Standard Automated Perimetry | |||||

Expt 1 | 0.96 | 0.75 | |||

Expt 2 | 0.59 | 0.88 | |||

STATPAC Global Indices | |||||

MD | 0.45 | 0.65 | 0.837 ± 0.022 | ||

PSD | 0.61 | 0.76 | 0.884 ± 0.020 | ||

CPSD | 0.64 | 0.74 | 0.844 ± 0.025 | ||

GHT | 0.67 | ||||

Statistical Classifier | |||||

LDF | 0.32 | 0.60 | 0.832 ± 0.023 | ||

LDF with PCA | 0.48 | 0.64 | 0.879 ± 0.018 | ||

Machine Classifiers | |||||

MLP | 0.25 | 0.75 | 0.897 ± 0.017 | ||

MLP with PCA | 0.54 | 0.71 | 0.893 ± 0.018 | ||

SVM linear | 0.44 | 0.69 | 0.894 ± 0.017 | ||

SVM linear with PCA | 0.51 | 0.67 | 0.887 ± 0.018 | ||

SVM Gaussian | 0.53 | 0.71 | 0.903 ± 0.017 | ||

SVM Gaussian with PCA | 0.57 | 0.75 | 0.899 ± 0.017 | ||

MoG (QDF) | 0.61 | 0.79 | 0.917 ± 0.016 | ||

MoG (QDF) with PCA | 0.67 | 0.78 | 0.919 ± 0.016 | ||

MoG with PCA | 0.67 | 0.79 | 0.922 ± 0.015 | ||

MGG with PCA | 0.01 | 0.78 | 0.906 ± 0.022 |

Classifier | MD | PSD | CPSD | LDF PCA^{*} | MLP | SVM linear | SVM Gauss | MoG(QDF)^{, †} | MoG PCA | MGG PCA |
---|---|---|---|---|---|---|---|---|---|---|

ROC Area | 0.837 | 0.884 | 0.844 | 0.879 | 0.878 | 0.894 | 0.903 | 0.916 | 0.922 | 0.906 |

MD | P value | 0.019 | 0.77 | 0.018 | 0.007 | 0.0001 | <0.00005 | <0.00005 | <0.00005 | 0.0005 |

0.837 | Correlation | 0.55 | 0.42 | 0.61 | 0.60 | 0.76 | 0.76 | 0.54 | 0.56 | 0.49 |

PSD | 0.022 | 0.83 | 0.44 | 0.55 | 0.19 | 0.0009 | 0.006 | 0.18 | ||

0.884 | 0.73 | 0.48 | 0.55 | 0.56 | 0.69 | 0.88 | 0.72 | 0.60 | ||

CPSD | 0.16 | 0.024 | 0.034 | 0.007 | 0.0003 | 0.0001 | 0.004 | |||

0.844 | 0.38 | 0.42 | 0.42 | 0.52 | 0.58 | 0.62 | 0.54 | |||

LDF PCA | 0.13 | 0.56 | 0.020 | 0.021 | 0.0055 | 0.10 | ||||

0.879 | 0.77 | 0.91 | 0.83 | 0.56 | 0.58 | 0.55 | ||||

MLP | 0.71 | 0.55 | 0.17 | 0.10 | 0.60 | |||||

0.878 | 0.86 | 0.85 | 0.65 | 0.56 | 0.53 | |||||

SVM linear | 0.18 | 0.12 | 0.048 | 0.44 | ||||||

0.894 | 0.92 | 0.63 | 0.62 | 0.58 | ||||||

SVM Gauss | 0.26 | 0.15 | 0.84 | |||||||

0.903 | 0.75 | 0.67 | 0.60 | |||||||

MoG(QDF) | 0.68 | 0.52 | ||||||||

0.916 | 0.62 | 0.50 | ||||||||

MoG PCA | 0.044 | |||||||||

0.922 | 0.88 |

**Figure 1.**

**Figure 1.**

**Figure 2.**

**Figure 2.**

**Figure 3.**

**Figure 3.**

MoG (n = 41) | Expt. 1 (n = 39) | PSD (n = 41) | All Three (n = 34) | |
---|---|---|---|---|

Mean Deviation | −0.80 ± 1.49* | 0.74 ± 1.58 | 0.82 ± 1.67 | 0.52 ± 1.36 |

No. of points P < 5% total deviation | 3.95 ± 4.89 | 4.03 ± 5.87 | 4.49 ± 7.06 | 2.82 ± 3.91 |

No. of points P < 5% pattern deviation | 3.02 ± 3.55 | 2.36 ± 1.97 | 2.32 ± 1.85 | 2.09 ± 1.94 |

No. of contiguous points pattern deviation P < 5% | 1.80 ± 2.46 | 1.26 ± 1.27 | 1.20 ± 1.19 | 1.15 ± 1.23 |

MoG (n = 8) | Expt. 1 (n = 8) | PSD (n = 8) | All Three (n = 3) | |
---|---|---|---|---|

Mean deviation | −1.13 ± 1.77 | −1.91 ± 0.82 | −1.63 ± 1.28 | −2.29 ± 0.68 |

No. of points P < 5% total deviation | 7.13 ± 6.01 | 8.32 ± 5.83 | 7.88 ± 6.13 | 12.00 ± 6.00 |

No. of points P < 5% pattern deviation | 6.13 ± 4.45 | 5.75 ± 4.43 | 6.63 ± 4.10 | 10.00 ± 4.58 |

No. of contiguous points pattern deviation P < 5% | 2.63 ± 2.20 | 3.00 ± 1.69 | 3.00 ± 1.85 | 4.67 ± 1.53 |

Normal field | 8 | 8 | 7 | 3 |

*Invest Ophthalmol Vis Sci*

*.*1990;31:512–520. [PubMed]

*Arch Ophthalmol*

*.*1984;102:704–706. [CrossRef] [PubMed]

*Acta Ophthalmol*

*.*1991;69:210–216.

*Invest Ophthalmol Vis Sci*

*.*1990;31:S503.Abstract nr 2471

*Invest Ophthalmol Vis Sci*

*.*1994;35:3362–3373. [PubMed]

*Eye*

*.*1994;8:321–323. [CrossRef] [PubMed]

*Arch Ophthalmol*

*.*1997;115:725–728. [CrossRef] [PubMed]

*Am J Ophthalmol*

*.*1996;121:511–521. [CrossRef] [PubMed]

*Artif Intell Med*

*.*1997;10:99–113. [CrossRef] [PubMed]

*Lancet*

*.*1996;347:1146–1150. [CrossRef] [PubMed]

*Lancet*

*.*1996;347:12–15. [CrossRef] [PubMed]

*Lancet*

*.*1997;350:469–472. [CrossRef] [PubMed]

*Ophthalmology*

*.*1989;96:616–619. [CrossRef] [PubMed]

*Automated Static Perimetry*

*.*1999; 2nd ed. Mosby New York, NY.

*Klin Monatsb Augenheil*

*.*1984;184:374–376. [CrossRef]

*Automated visual field evaluation.*

*Arch Ophthalmol*.*.*1992;110:812–819.

*Annales Academiae Scientiarium Fennicae, Series AI: Mathematica-Physica*

*.*1947;37:3–79.

*Nature*

*.*1986;323:533–536. [CrossRef]

*Complex Syst*

*.*1988;2:321–355.

*Pattern Recognition*

*.*1992;25:65–77. [CrossRef]

*Statistical Learning Theory*

*.*1998; Wiley New York.

*The Nature of Statistical Learning Theory*

*.*2000; 2nd ed. Springer New York.

*Proceeding of CIKM ’98 7th International Conference on Information and Knowledge Management*

*.*1998;148–155. New York.

*Neural Networks: Proceedings of the CTP-PBSRI Joint Workshop on Theoretical Physics*

*.*1995;261–276. World Scientific Singapore.

*Oper Res*

*.*1995;43:570–577. [CrossRef]

*Data Mining Knowledge Discovery*

*.*1998;2:121–167. [CrossRef]

*Neural Networks: A Comprehensive Foundation*

*.*1999; 2nd ed. Prentice-Hall Upper Saddle River, NJ.

*Neural Comput*

*.*1991;3:79–87. [CrossRef]

*IEEE Trans Pattern Analysis Machine Intell*

*.*2000;22:1078–1089. [CrossRef]

*IEEE Trans Neural Networks*

*.*2000;11:1124–1136. [CrossRef]

*Advances in Kernel Methods—Support Vector Learning*

*.*1998;185–208. MIT Press Cambridge, MA.

*Advances in Large Margin Classifiers*

*.*2000;61–74. MIT Press Cambridge, MA.

*Clin Chem*

*.*1993;39:561–577. [PubMed]

*Biometrics*

*.*1998;44:837–845.

*Radiology*

*.*1983;148:839–843. [CrossRef] [PubMed]

*Crit Rev Diag Imaging*

*.*1989;29:307–335.

*Medical Decision Making*

*.*1984;4:137–150. [CrossRef] [PubMed]

*J Roy Statist Soc Ser*

*.*1974;36:111–147.

*Glaucoma*

*.*1999;8:77–80.

*Ophthal Physiol Opt*

*.*1994;14:239–248. [CrossRef]

*Arch Ophthalmol*

*.*1993;111:645–650. [CrossRef] [PubMed]

*Invest Ophthalmol Vis Sci*

*.*1990;31:1869–1875. [PubMed]

*Invest Ophthalmol Vis Sci*

*.*1997;28:413–425.

*Clin Vis Sci*

*.*1992;7:371–383.