Investigative Ophthalmology & Visual Science Cover Image for Volume 52, Issue 6
May 2011
Volume 52, Issue 6
Free
Clinical and Epidemiologic Research  |   May 2011
What Reduction in Standard Automated Perimetry Variability Would Improve the Detection of Visual Field Progression?
Author Affiliations & Notes
  • Andrew Turpin
    From the Departments of Computer Science and Software Engineering, and
  • Allison M. McKendrick
    Optometry and Vision Sciences, The University of Melbourne, Melbourne, Victoria, Australia.
  • Corresponding author: Andrew Turpin, Department of Computer Science and Software Engineering, The University of Melbourne, Melbourne, Victoria, Australia; [email protected]
Investigative Ophthalmology & Visual Science May 2011, Vol.52, 3237-3245. doi:https://doi.org/10.1167/iovs.10-6255
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Andrew Turpin, Allison M. McKendrick; What Reduction in Standard Automated Perimetry Variability Would Improve the Detection of Visual Field Progression?. Invest. Ophthalmol. Vis. Sci. 2011;52(6):3237-3245. https://doi.org/10.1167/iovs.10-6255.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose.: The test–retest variability of standard automated perimetry (SAP) severely limits its ability to detect sensitivity decline. Numerous improvements in procedures have been proposed, but assessment of their benefits requires quantification of how much variability reduction results in meaningful benefit. This article determines how much reduction in SAP procedure variability is necessary to permit earlier detection of visual field deterioration.

Method.: Computer simulation and statistical analysis were used. Gaussian distributions were fit to the probability of observing any sensitivity measurement obtained with SAP and the Full Threshold algorithm to model current variability. The standard deviation of these Gaussians was systematically reduced to model a reduction of SAP variability. Progression detection ability was assessed by using pointwise linear regression on decreases of −1 and −2 dB/year from 20 and 30 dB, with a custom criteria that fixed detection specificity at 95%. Test visits occurring twice and thrice per annum are modeled, and analysis was performed on single locations and whole fields.

Results.: A 30% to 60% reduction in SAP variability was required to detect pointwise deterioration 1 year earlier than current methods, depending on progression rate and visit frequency. A reduction of 20% in variability generally allowed progression to be detected one visit earlier.

Conclusions.: On average, the variability of SAP procedures must be reduced by approximately 20% for a clinically appreciable improvement in detection of visual field change. Analysis similar to that demonstrated can measure the improvement required of new procedures, assisting in cost–benefit assessment for the adoption of new techniques, before lengthy and expensive clinical trials.

Standard automated perimetry (SAP) is one of the most regularly used clinical visual assessment techniques. In its most commonly implemented form, visual sensitivity is measured for Goldmann size III targets across approximately the central 30° of visual field. One of the major shortcomings of this task is high test–retest variability, particularly for visual field locations of reduced sensitivity. 1,2 High variability makes the detection of visual field change difficult, with progressive conditions such as glaucoma requiring large sequences of test data from many patient visits to detect clinical change. 3 6  
SAP could be improved by modifying the procedure—for example, improving either the test algorithms, or human factors associated with correct and attentive completion of the test. A variety of test algorithms have been proposed, designed either to improve the accuracy of the outcomes in areas of visual field loss 7,8 or to facilitate rapid completion while maintaining traditional test–retest variability. 9 12 As the instructions given to patients can significantly affect perimetric outcomes, 13 another approach might be to provide more consistent and informative instructions to patients, combined with more effective training of staff involved in administering visual field tests. 
A commonly used model of the response variability of a given individual within a single test session is the psychometric function (or frequency-of-seeing [FoS] curve). For perimetric stimuli, this function describes the probability with which a stimulus of a particular luminance or contrast will be seen by the individual observer, assuming that the subject is performing the test as instructed. Psychometric functions measured with current SAP size III stimuli show a marked flattening (increase in variability) as sensitivity decreases. 14 An alternate proposed method for reducing perimetric variability is to alter the test stimuli in a manner that results in steeper psychometric functions. For example, increasing from a size III to a size V target results in steeper FoS curves. 15 Steeper FoS curves have also been proposed as an advantage of using the low spatial, high temporal frequency stimuli of the frequency-doubling perimeter; however, the comparison is not straightforward because the test dynamic range is restricted relative to SAP. 16,17 Keeping test–retest variability consistent across the test stimulus range has also been the goal of new stimuli proposed by Hot et al. 18 FoS curves can also be manipulated by altering the attentional demands of the task. 19  
Choosing to move to a new perimetric test procedure (measurement algorithm, stimulus, or both) is nontrivial, because new baseline measures, analysis techniques, and retraining are necessary. Although all these approaches for reducing perimetric variability are likely to have merit, we currently do not have any quantification of how much improvement in a test procedure would result in a meaningful increase in the ability to detect visual field change. The purpose of this study was to use computer simulation and statistical techniques to address this shortfall. We took what is known about current SAP variability and determined how much reduction in variability would be needed to result in improvements in detecting visual field change. 
Methods
Overview
First, we modeled the variability of current SAP tests by fitting Gaussian distributions to the behavior of the Full Threshold (FT) algorithm (Humphrey Field analyzer [HFA]; Carl Zeiss Meditec, Dublin, CA), assuming a reliable patient model. In practice, SITA has largely replaced FT as the HFA clinical algorithm of choice; however the precise implementation of SITA is not described in the public domain. Using FT as a model is relevant because SITA was designed to have the same error characteristics as FT but to require fewer presentations. 10 Clinical studies demonstrate qualitatively similar threshold estimates and test–retest sensitivity between FT and SITA, 1,2,9,20,21 with SITA thresholds being, on average, approximately 1 dB higher. Using this model, we examined the ability to detect linear decreases in visual field estimates using pointwise linear regression (PLR) with a fixed specificity of 95% when the standard deviations of the test procedure error distributions were systematically reduced. We also examined the ability to detect linear decreases in Mean Defect scores for whole fields with a varying number of linearly deteriorating visual field locations. Finally, we examined how tests might be modified to achieve error reductions capable of detecting visual field deterioration 1 year earlier than current. We now describe each method in detail. 
Model of Perimetric Variability
We used the FT algorithm 22 as the base procedure for our work. It is a staircase procedure that modifies stimulus luminance in steps of 4 dB until the first response reversal occurs and subsequently in steps of 2 dB. The HFA implementation of FT terminates after two response reversals and takes the stimulus luminance of the last-seen presentation as the final sensitivity estimate for a given location. In our implementation of FT, if the first estimate was more than 4 dB away from the starting point of the staircase, then the procedure was repeated using the first sensitivity estimate as the starting point, and the result of this second staircase was taken as the final estimate of sensitivity. 
We computed the error distribution of FT as follows. Given the FoS curve (psychometric function) of a patient for a particular location in the visual field and the starting point of the FT procedure, we could compute the probability of obtaining any particular measurement for that location, assuming that the patient was performing the task correctly. We used a technique that we have used previously, in which paths in the binary decision tree for the FT procedure with a given start guess are labeled with probabilities from an assumed patient's FoS curve. 23 The leaves of the tree represent possible outcomes from the procedure with an associated probability. An example of two probability distributions derived in this study is shown in Figure 1
Figure 1.
 
Probability of obtaining a measurement using FT (averaged over all starting points) when the patient has a true sensitivity of 4 dB (left) and 10 dB (right) assuming 1% false-positive and -negative rates, and a slope of the FoS curve as 6 dB SD of a cumulative Gaussian function. Bars: actual probabilities; solid line: best-fit Gaussian.
Figure 1.
 
Probability of obtaining a measurement using FT (averaged over all starting points) when the patient has a true sensitivity of 4 dB (left) and 10 dB (right) assuming 1% false-positive and -negative rates, and a slope of the FoS curve as 6 dB SD of a cumulative Gaussian function. Bars: actual probabilities; solid line: best-fit Gaussian.
Both panels use a FoS curve described by Abott's formula. 24   where fp is the false-positive rate defining the lower asymptote of Ψ; fn is the false-negative rate defining the upper asymptote of Ψ; s is the standard deviation of a cumulative Gaussian defining the spread of Ψ; t is the threshold, or translation of Ψ, along the abscissa; and G(x, t, s) is the value at x of a cumulative Gaussian distribution with mean t and SD s. For these experiments, we assumed that the patient would make few errors, and so we set the false-positive and -negative rates to 1%. With reducing visual field sensitivity, the psychometric function slope flattens for size III SAP targets. 14,15 Our model accounts for this dependence of variability on sensitivity by varying the standard deviation of the Gaussian with sensitivity by using s = exp(−0.066 × t + 2.81) as previously reported for clinical data 14 capped at a maximum of 6 dB. In the two examples shown in Figure 1, we assumed that all starting points for the FT procedure are equally likely and summed and normalized the 41 possible distributions (start points of 0,1… 40 dB) into the ones shown. 
We computed these distributions for values of t from 0 to 40 dB, then for each distribution we fit a Gaussian: G(0… 40, mt , st ) using the nonlinear minimization (nlm) function in R to minimize the L 1 norm) (see Fig. 1 for examples where t = 4 and t = 10 dB). Because we assumed a high variability in patient response (flat FoS curve) when true sensitivity was low, there was a “floor effect” where FT returned 0 dB very often. This effect can be seen in Figure 1, left, where, although the true sensitivity is 4 dB, 0 dB occurs nearly as often as 3 dB. This situation arises when a patient does not see 0 dB twice; then, FT terminates and returns a sensitivity of 0 dB, resulting in multimodal distributions for low true sensitivities, as in Figure 1, left, making the Gaussian fit poor. This phenomenon has been widely reported in clinical studies of the test–retest distribution of SAP thresholds, 1,2,16 where the distribution of retest values is skewed to lower sensitivities, particularly for low test sensitivities. We note in passing that if we sample test and retest values from our fitted Gaussian distributions (raising any observed negative values to 0), we get very similar skew distributions, and so the use of symmetrical Gaussians to model measured–given–true sensitivities is not inconsistent with the observed asymmetrical test–retest distributions reported in the literature. Examples of such distributions are shown in the Results section. 
Another idiosyncrasy of the FT algorithm is that it underestimates sensitivities by approximately 1 dB on average because the estimate returned is the last seen stimulus. 1,25 We “recenter” the distributions by setting the mean to the true sensitivity value of the patient. Thus, the distribution of possible outcomes from FT for true sensitivity t is described by G(0… 40, t, st ). 
By choosing a true sensitivity value and then sampling from G(0… 40, t, st ), we get a measured sensitivity value that is typical of current SAP procedures. As the purpose of this work was to investigate the benefits of improving SAP variability, we simulated improved procedures by systematically reducing st in steps of 10%, sampling from G(0… 40, t, 90% × st ), G(0… 40, t, 80% × st ), and so on. We also examined the ability to detect progression if the FoS slopes were consistent across the range of available sensitivity estimates—that is, if there was no flattening of psychometric function slope with decreasing sensitivity. Avoiding a flattening of FoS with sensitivity is the goal of several research groups exploring alternate stimulus types to current size III SAP targets. 15,18  
Simulated Progression for Single Locations
To study the effect of reducing test error on detecting change, we simulated progression by decreasing a single location from a starting sensitivity of 30 dB by either 1 or 2 dB per year for 10 years. Visual fields are measured twice and thrice per year during this period. A starting point of 30 dB avoids the “floor effect” of current SAP procedures (whereby for true thresholds lower than 10 dB, 0 dB is a commonly returned sensitivity value), and represents a range of vision where the detection of glaucomatous progression is clinically important to management decisions designed to prevent further visual field deterioration. We also simulated progression for a single location decreasing from 20 dB by either 1 or 2 dB for 10 years. This model represents progression from an abnormal baseline such as in moderate to advanced glaucoma, and explores the influence of the floor effect present in current SAP procedures. The simulation was repeated 10,000 times for each condition (progression rate of 1 or 2 dB/year; visual fields measured biannually or triannually). 
Measuring Time until Progression Confirmed
To classify a location as progressing, we used PLR. There are various criteria in the literature for defining progression using PLR, the most common of which is that the slope of the regression line through sensitivities measured at a location must have a statistically significant slope less than −1 dB per annum (P < 0.01). 26 All of these criteria are based around clinical data collected using current SAP procedures. It is likely that these existing criteria will be too specific and conservative for determining change for procedures that are modeled to have significantly less errors than current techniques. 
Fortunately, in this study, we knew the precise distribution of measurement error for each measured value: G(0… 40, t, st ), and the true baseline sensitivity for any location was 30 dB. Thus, it was possible to derive precise 95% confidence intervals for the slope of a regression line on a sequence of measurements that did not change. Repeatedly generating sequences of measurements of various lengths (numbers of visits) drawn from the Gaussian distribution G(0… 40, 30, s 30 = 1.65), for a baseline of 30 dB or G(0… 40, 20, s 20 = 3.17) for a baseline of 20 dB gives the distribution of the slope of a regression line through these sequences. Then, for each number of visits (sequence length) we can take the lower 5% quantile slope value as a cutoff point for determining change with a specificity of 95%. The number of visits per year, assuming we wish to express the cutoff slope in decibels per year, must also be included in the repeated regressions as a scaling of the independent variable (time). Figure 2 shows the values used in this study. 
Figure 2.
 
Slope cutoff criteria for calling progression using PLR derived by sampling 10,000 sequences of a subject stable at 30 or 20 dB and taking the lower 5% quantile of the collected slopes (hence keeping specificity fixed at 95%).
Figure 2.
 
Slope cutoff criteria for calling progression using PLR derived by sampling 10,000 sequences of a subject stable at 30 or 20 dB and taking the lower 5% quantile of the collected slopes (hence keeping specificity fixed at 95%).
Simulated Progression for Global Field Summary Index: Mean Defect
We also simulated progression on whole fields, to study the effect that reducing error would have on time-to-call progression if linear regression on MD were used. Each field began with 52 locations with true thresholds of 30 dB. Between 1 and 10 locations were decreased by 1 or 2 dB per year for 6 years. To set a criteria for labeling a field as progressing or not, we used the same technique as was used for the PLR that led to the generation of Figure 2. Namely, we generated 1000 measured fields for 52 stable locations at 30 dB and took the 95% percentile at each time point (1 to 12 or 18 visits, depending on whether 2 or 3 visits were conducted per year), to determine a slope cutoff that would give 95% specificity. We did not include an ageing adjustment in the simulations, but this would just be a small, constant addition to the regression slope and so would simply alter the cutoffs required rather than changing the form of the results. Further, since the period of investigation was short (several years), we suspect that ageing would have minimal impact. As we were interested only in the slope of linear regression on MD and not the actual MD values themselves, we calculated the slope on the mean of the raw field values, rather than on the mean of the total deviation values, giving precisely the same slope as if a constant, age-matched normative field had been deducted from the raw values. This simplified the calculations and removed the need for normative data, while providing exactly the same slope values as if the full MD calculation were performed. As we did for PLR, we investigated regression slopes on MD for determining progression for two and three visits per year. We report the time required to obtain a sensitivity of 80% and 90% (with the fixed specificity of 95%). 
Results
Single Locations
Figure 3 shows the proportion of the pool of 10,000 progressing locations that were flagged as progressing according to the criteria shown in Figure 2. The solid square curve (error representative of current procedures) at 4 years in the top two panels indicates a sensitivity of 80% and 99% for declines of 1 and 2 dB/year, respectively. This sensitivity and time frame needed to detect progression were consistent with those in other studies modeling PLR. 26,27 When the number of visits per year was increased to 3, the sensitivity for detecting a change of −1 dB/year increased to 89% (bottom left panel) and to 100% for −2 dB/year. 
Figure 3.
 
Proportion of progressing locations identified, where each point begins at 30 dB and deteriorates as per the label in the bottom right corner of each panel for 6 years. The number of fields measured per year also appears in the bottom right of each panel. (■) Current SAP procedures; points designated x represent a reduction factor (1 − x × 10%) in the standard deviation for the procedure (i.e., 8 is 80% of the SD of current procedures or a 20% reduction in error). Lines are added for clarity. Shaded region: 1 year relative to current procedures (■), hence symbols appearing to the left of the region indicate that progression is determined more than 1 year earlier than current procedures.
Figure 3.
 
Proportion of progressing locations identified, where each point begins at 30 dB and deteriorates as per the label in the bottom right corner of each panel for 6 years. The number of fields measured per year also appears in the bottom right of each panel. (■) Current SAP procedures; points designated x represent a reduction factor (1 − x × 10%) in the standard deviation for the procedure (i.e., 8 is 80% of the SD of current procedures or a 20% reduction in error). Lines are added for clarity. Shaded region: 1 year relative to current procedures (■), hence symbols appearing to the left of the region indicate that progression is determined more than 1 year earlier than current procedures.
Reading horizontally across the curves in Figure 3 gives an indication of the reduction in procedure error necessary to achieve a certain time to detection while holding sensitivity constant. For example, to achieve 80% sensitivity to a change of −1 dB/year in 3 years with two visits per year, we find the highest number curve that intersects the point (3, 80%), which is (roughly) curve 6, indicating that the procedures would have to have a standard deviation 60% of the current behavior (40% reduction in variability). Figure 3 generally shows that a reduction in error of between 40% and 60% of current would reduce detection times by 1 year, depending on rate of progression. 
Figure 4 presents results for the situation in which the 10,000 locations begin at 20 dB. The format of the figure is the same as in Figure 3. Because of the increased variability of responses at lower sensitivity levels, the time needed to detect the same proportion of progressing locations increased relative to the 30-dB situation (the curves are shifted to the right relative to Fig. 3). Approximately a 30% to 40% reduction in error resulted in progression being detected 1 year earlier on average. 
Figure 4.
 
As in Figure 3, but with each location beginning at 20 dB.
Figure 4.
 
As in Figure 3, but with each location beginning at 20 dB.
Global Fields: Mean Defect
Figure 5 shows results for the whole field condition where a varying number of test locations (indicated on the x-axes of each panel) are linearly decreasing in sensitivity from 30 dB. Figures 5A and 5B show results for progression of 1 dB/year for two visits and three visits per year, respectively. The left-hand panels provide results for 80% sensitivity to call progression, whereas the right-hand panels show 90% sensitivity. Figures 5C and 5D are similarly formatted, but for the more rapidly progressing case (2 dB/year). In each panel, the light bars show the time (number of years) taken to reach the criterion sensitivity for detecting progression (80% or 90%). As expected, this time decreased with the increase in number of progressing locations (1–10) and is reached more rapidly for 80% sensitivity than 90% sensitivity. The dark bars show the time improvement gained if the procedure error is reduced by 40%. We explored the improvement for a range of different error reductions (as per Figs. 3, 4) but show here the 40% condition as it improved the ability to detect progression by approximately 1 year on average. 
Figure 5.
 
The lighter bars show the number of years needed to reach a sensitivity of 80% (left) and 90% (right) using criteria on the slope of linear regression on MD, such that specificity is fixed at 95%. The number under each bar is the number of locations (of 52) that had a true threshold of 30 dB that decreased by either 1 dB per year (A, B) or 2 dB per year (C, D); the remainder of the true field was stable at 30 dB. (A) and (C) give results if visits are twice per year, whereas (B) and (D) show the results for three visits per year. The darker bars show the reduction in the number of years to obtain 80% or 90% sensitivity that would accrue if a test with 40% less error (60% of current) were used. Bars that were truncated at 6 dB have upward arrows at the top.
Figure 5.
 
The lighter bars show the number of years needed to reach a sensitivity of 80% (left) and 90% (right) using criteria on the slope of linear regression on MD, such that specificity is fixed at 95%. The number under each bar is the number of locations (of 52) that had a true threshold of 30 dB that decreased by either 1 dB per year (A, B) or 2 dB per year (C, D); the remainder of the true field was stable at 30 dB. (A) and (C) give results if visits are twice per year, whereas (B) and (D) show the results for three visits per year. The darker bars show the reduction in the number of years to obtain 80% or 90% sensitivity that would accrue if a test with 40% less error (60% of current) were used. Bars that were truncated at 6 dB have upward arrows at the top.
Comparison of Possible Approaches to Reduce Variability
In the preceding section, we identified that a reduction of between 40% and 60% in variability would be required to improve the ability to detect progression by approximately 1 year on average. Is it possible for this magnitude of reduction in variability to be realized? Possible approaches include (1) altering the test algorithm or (2) changing the stimulus type in a manner that reduces the significant flattening of the psychometric function with decreasing sensitivity (for example: by increasing stimulus size and or spatial/temporal frequency content 15,17 ). To investigate the potential benefits of these approaches, we used our model of perimetric variability to calculate the error distribution of several alternate theoretical procedures. The method was the same as described above for determining the error distribution for the FT procedure, substituting either a different test algorithm or FoS curve. Figure 6 shows the resultant standard deviation of the Gaussian fitted to the error distributions for the following procedures:
  1.  
    A ZEST procedure (a Bayesian adaptive procedure used in the Humphrey Matrix Perimeter [Carl Zeiss Meditec] and described fully elsewhere 28,29 ). The procedure has been shown to return less variable thresholds on average when run for a comparable number of presentations as FT. We tried several ZEST procedures and show here an implementation that reduced perimetric variability by approximately 40%. Note, to achieve this reduction in variability, ZEST required an average of 20 presentations to terminate (termination criteria being set to when the standard deviation of the posterior distribution was less than 0.5 dB), and was seeded with a uniform prior distribution. Figure 6 includes a version of the same ZEST with nine presentations on average to terminate (termination criteria: standard deviation of the posterior distribution was less than 1.7 dB). These procedures are included to be indicative of the types of performance gains expected by improving test algorithms for SAP simply by asking more questions of the observer to improve the reliability of estimates.
  2.  
    The FT procedure using a constant FoS spread of 2 dB across the entire stimulus range tested. This model is included to represent a theoretical new procedure for which psychometric function slope is constant with decreasing sensitivity.
Figure 6.
 
The standard deviations of the Gaussian used to model procedure error in this study (■ and 6 curves), in addition to three new procedures. Constant FoS (○) assumes a constant FoS spread of two for all levels of stimuli; ZEST 9 (▾) and ZEST 20 (▴) are ZEST procedures with a uniform prior, terminating when the standard deviation of the pdf (probability density function) falls below 1.7 (average, 9 presentations) and 0.5 (average, 20 presentations), respectively.
Figure 6.
 
The standard deviations of the Gaussian used to model procedure error in this study (■ and 6 curves), in addition to three new procedures. Constant FoS (○) assumes a constant FoS spread of two for all levels of stimuli; ZEST 9 (▾) and ZEST 20 (▴) are ZEST procedures with a uniform prior, terminating when the standard deviation of the pdf (probability density function) falls below 1.7 (average, 9 presentations) and 0.5 (average, 20 presentations), respectively.
Figure 6 shows that relatively minor gains were achieved by simply running test algorithms for a little longer. Counteracting these improvements are likely increases in variability due to fatigue or loss of attention during longer tests which are not included in the model. Significantly greater gains in variability reduction are likely to be achieved by altering the test stimulus rather than the test algorithm. 
Comparison of Our Model to Empiric Clinical Test–Retest Data for SAP
The results above suggest that a 20% to 40% reduction in error would have a significant and clinically meaningful impact on the average ability to detect visual field deterioration. In this section we illustrate what clinical test–retest data are predicted to look like in the presence of such a reduction in error. Using computer simulations has allowed us to precisely control the error model used, but clinical studies are restricted to publishing test–retest variability as a surrogate for error. Figure 7 shows test–retest graphs that were generated by our error models. The three plots show the median and 95% confidence limits expected at retest, given the baseline test value on the x-axis, with the dashed lines showing test–retest data from Figure 4b of Artes et al. 17 The experimental results were generated assuming a Gaussian model with standard deviations given by 80% × exp(−0.06 × (0… 35) + 2.4), which is 80% of the best fit of FoS for normals. 14 This was chosen so that the 95% confidence intervals roughly equalled those reported previously, 17 which illustrates that test–retest information similar to that encountered clinically can be generated by our Gaussian model of error (Fig. 7, leftmost panel). The “patient” testing methodology used to generate the results is the same as that of Artes et al. 17 : each model patient generated six visual fields, and all combinations of pairs were taken as baseline and follow-up in turn. The true thresholds of the patient population had 200 thresholds of each value from 0 to 35 dB. The middle and right-hand panels illustrate the predicted test–retest distribution with a 20% and 40% reduction in error, respectively. 
Figure 7.
 
Test–retest results generated from our computer simulation compared with previously published test–retest data for SAP. Dashed lines: the 95% limits of empiric test–retest data published in Figure 4b of Artes et al. 17 The line plots show the median and the 2.5th and 97.5th quantiles of test–retest results generated by computer simulation. Middle and right panels: the likely test–retest error distributions that would result from a 20% and 40% reduction in the error of our simulated patient model.
Figure 7.
 
Test–retest results generated from our computer simulation compared with previously published test–retest data for SAP. Dashed lines: the 95% limits of empiric test–retest data published in Figure 4b of Artes et al. 17 The line plots show the median and the 2.5th and 97.5th quantiles of test–retest results generated by computer simulation. Middle and right panels: the likely test–retest error distributions that would result from a 20% and 40% reduction in the error of our simulated patient model.
Discussion
By simulating visual field measurements for an assumed progressing patient using procedures with different error characteristics and then measuring the time to detect progression using PLR, we have shown that the variability in procedures must decrease by approximately 40% that of current procedures to detect progression 1 year earlier, depending on progression rate and method of detecting progression (global fields versus single locations). An improvement of approximately 1 visit on average requires a reduction in variability of 20%. Consequently, proposed improvements to SAP procedures that have only minor reductions in variability are unlikely to afford any meaningful benefit to the ability to detect visual field progression on average. 
We speculate on ways that test procedures may be altered to achieve such reductions in variability. Figure 6 investigates the number of test presentations that might be required per location within a visual field test and suggests that such reductions are not likely to be possible in a clinically feasible number with current test stimuli, test patterns, and FoS curves. This analysis assumes that the error reductions are necessary across the entire measurement range and assumes a dynamic range of current SAP procedures. There are alternatives, however, such as only reducing procedure variability by increasing presentations for select locations or sensitivity values where progression is deemed most important to detect or likely to be present. Alternately, one could combine suprathreshold screening and full-threshold determination such as in the EMU (estimation minimizing uncertainty) procedure we have previously explored, which uses more presentations in areas of reduced sensitivity to reduce variability. 7,8  
An alternate approach to reducing test procedure variability is simply to test more frequently with existing methods. Figure 3 shows that testing three times per year with current methods was roughly equivalent to testing twice a year with a procedure that has 70% of the current error, for locations progressing at 1 dB per year. For more rapid progression, there was still benefit in testing three times per year, rather than two; however, the relative advantage decreased (Fig. 3). 
Our simulations suggest that significant gains in the ability to detect progression are likely to be achieved by the development of perimetric procedures for which FoS responses are steeper across the dynamic range of testing. This possibility is illustrated by our theoretical procedure in Figure 6 that fixes FoS across the range of testing, The issue of dynamic range is important in the interpretation of our simulations in the context of existing clinical procedures. Our simulations assumed the dynamic range and error characteristics of SAP; however, alternate stimulus types may have different measurement scales. For example, the test–retest variability of frequency-doubling technology (FDT) perimetry does not increase with decreasing sensitivity in the same manner as SAP. 16,17 In terms of a decibel scale, FDT has a significantly reduced dynamic range relative to SAP, creating a numerical floor in terms of contrast scaling. A simple comparison of decibel scaling is not a direct correlate of the ability of these stimulus types to detect progression or to quantify more advanced visual field loss, however, due to marked differences in the spatial extent of the stimuli and their different temporal properties. 
In future, some combination of improved algorithms with different stimuli and test procedures is likely to yield the most benefit—for example, a task that uses stimuli and procedures that result in steeper FoS curves in areas of midrange loss (possibly larger targets 15 ), combined with altering the procedure to maintain attentional demands throughout the test 18 and running the procedure for longer in areas of interest (while perhaps decreasing the number of presentations required elsewhere). New psychophysical stimuli may confer the added advantage of assessing alternate aspects of visual function that are damaged either earlier or later in the disease process relative to contrast detection of size III targets, hence assisting in the monitoring of both early- and late-stage disease progression. 
There are myriad clinical scenarios; hence, the results presented herein can reflect only a subset of those individual situations encountered in practice. The simulations portray the on-average population performance for the specific rates of change and follow-up timings included in our simulations. Figures 3, 4, and 5 highlight results that show the levels of error reduction necessary to improve detection by 1 year as a point of comparison between the different test conditions. We do not mean to suggest that only a 12-month improvement is meaningful. Clearly the level of benefit depends on the context. A reduction of large clinical trial duration of 6 months is of likely importance. For a given patient, if testing three times a year, an improvement of 4 months to detect progression (one test visit) is meaningful. Results for many of these scenarios are included in Figures 3 and 4, but space limitations prevent extensive discussion here. The specific benefits of reducing error for a given patient in a clinical context will necessarily depend on their rate of change, how often they are reviewed clinically, and whether they are more or less variable in their response than the error models presented within the simulations. 
Computer simulation has both advantages and disadvantages compared with the analysis of real clinical data. The benefit of computer simulation is that the observer's true underlying threshold is known, the rate of true visual field deterioration is known, and thousands of tests can be conducted to enable a robust appreciation of the distribution of errors. In contrast, clinical test–retest data are limited as estimates of test–retest performance have to be determined from a very small set of repeats per observer and the actual underlying true threshold is not known. However, the accuracy of the computer simulation relies on the accuracy of the simulated model relative to real performance, and the validity of any related assumptions regarding performance. In this article, we deliberately kept the false-positive and -negative rates low (1%), but we recognize that clinical catch-trial estimates of these parameters are often substantially higher. We strongly believe that true false-positive rates can be quite low with adequate training and reinstruction of participants early in the visual field test. False-negative rates measured with catch trials are confounded with probabilistic negative responses resulting from shallow FoS curves. This type of measured “false negative” is actually included in our model as a shallower FoS with decreasing sensitivity. We also capped the standard deviation of the Gaussian model of variability at 6 dB. This level is less than that reported for clinically measured FoS curves 14 ; however, there is a scarcity of data for areas of low sensitivity, and modeling the FoS is difficult due to floor effects; hence, we chose to incorporate a conservative error model. In any event, because our assumed model of patient variability is conservative, the reductions in SAP variability we have outlined are lower bounds on those that would actually be required with more variable patients. 
It is also interesting to treat the problem in the reverse of our treatment in this article, and ask the question: Given a target number of years to detect progression, what error characteristics are tolerable in a perimetric procedure? The type of analytical framework demonstrated in this article provides guidelines and goals for the improvement required of new procedures, and can assist in cost–benefit assessment for the adoption of new clinical techniques, before lengthy and expensive clinical trials. 
Footnotes
 Supported by Australian Research Council Grants FT0991326 (AT) and FT0990930 (AMM).
Footnotes
 Disclosure: A. Turpin, None; A.M. McKendrick, None
References
Artes PH Iwase A Ohno Y Kitazawa Y Chauhan BC . Properties of Perimetric Threshold Estimates from Full Threshold, SITA Standard, and SITA Fast strategies. Invest Ophthalmol Vis Sci. 2002;43:2654–2659. [PubMed]
Wild JM Pacey IE Hancock SA Cunliffe IA . Between-algorithm, between-individual, differences in normal perimetric sensitivity: Full Threshold, FASTPAC, and SITA. Invest Ophthalmol Vis Sci. 1999;40:1152–1161. [PubMed]
Kim J Dally LG Ederer F . The Advanced Glaucoma Intervention Study (AGIS): 14, distinguishing progression of glaucoma from visual field fluctuations. Ophthalmology. 2004;111:2109–2166. [CrossRef] [PubMed]
Keltner JL Johnson CA Quigg JM Cello KE Kass MA Gordon MO . Confirmation of visual field abnormalities in the Ocular Hypertension Treatment Study. Ocular Hypertension Treatment Study Group. Arch Ophthalmol. 2000;118:1187–1194. [CrossRef] [PubMed]
Heijl A Leske MC Bengtsson B Bengtsson B Hussein M Group E . Measuring visual field progression in the Early Manifest Glaucoma Trial. Acta Ophthalmol Scand. 2003;81:286–293. [CrossRef] [PubMed]
Vesti E Johnson CA Chauhan BC . Comparison of different methods of detecting glaucomatous visual field progression. Invest Ophthalmol Vis Sci. 2003;44:3873–3879. [CrossRef] [PubMed]
McKendrick AM Turpin A . Combining perimetric supra-threshold and threshold procedures to reduce measurement variability in areas of visual field loss. Optom Vis Sci. 2005;82:43–51. [CrossRef] [PubMed]
Turpin A McKendrick AM . Retesting visual fields: utilizing prior information to decrease test–retest variability in glaucoma. Invest Ophthalmol Vis Sci. 2007;48:1627–1634. [CrossRef] [PubMed]
Bengtsson B Heijl A . Evaluation of a new perimetric strategy, SITA, in patients with manifest and suspect glaucoma. Acta Ophthalmol Scand. 1998;76:368–375.
Bengtsson B Olsson J Heijl A Rootzen H . A new generation of algorithms for computerized threshold perimetry. Acta Ophthalmol Scand. 1997;75:368–375. [CrossRef] [PubMed]
Schiefer U Pascual JP Edmunhs B . Comparison of the new perimetric GATE strategy with conventional Full Threshold and SITA Standard Strategies. Invest Ophthalmol Vis Sci. 2009;50:488–494. [CrossRef] [PubMed]
Morales J Weitzman ML Gonzalez de la Rosa M . Comparison between Tendency-Oriented Perimetry (TOP) and octopus threshold perimetry. Ophthalmology. 2000;107:134–142. [CrossRef] [PubMed]
Kutzko KE Brito CF Wall M . Effect of instructions on conventional automated perimetry. Invest Ophthalmol Vis Sci. 2000;41:2006–2013. [PubMed]
Henson DB Chaudry S Artes PH Faragher EB Ansons A . Response variability in the visual field: comparison of optic neuritis, glaucoma, ocular hypertension and normal eyes. Invest Ophthalmol Vis Sci. 2000;41:417–421. [PubMed]
Wall M Kutzko KE Chauhan BC . Variability in patients with glaucomatous visual field damage is reduced using size V stimuli. Invest Ophthalmol Vis Sci. 1997;38:426–435. [PubMed]
Spry PGD Johnson CA McKendrick AM Turpin A . Variability components of standard automated perimetry and frequency doubling technology perimetry. Invest Ophthalmol Vis Sci. 2001;42:1404–1410. [PubMed]
Artes PH Hutchison DM Nicolela MT LeBlanc RP Chauhan BC . Thresholds and variability properties of Matrix Frequency-Doubling Technology and Standard Automated Perimetry in glaucoma. Invest Ophthalmol Vis Sci. 2005;46:245102457.
Hot A Dul MW Swanson WH . Development and evaluation of a contrast sensitivity perimetry test for patients with glaucoma. Invest Ophthalmol Vis Sci. 2008;49:3049–3057. [CrossRef] [PubMed]
Miranda MA Henson DB . Perimetric sensitivity and response variability in glaucoma with single-stimulus automated perimetry and multiple-stimulus perimetry with verbal feedback. Acta Ophthalmol. 2008;86:202–206. [CrossRef] [PubMed]
Bengtsson B Heilj A Olsson J . Evaluation of a new threshold visual field strategy, SITA, in normal subjects. Acta Ophthalmol Scand. 1998;76:165–169. [CrossRef] [PubMed]
Wild JM Pacey IE O'Neill EC Cunliffe IA . The SITA perimetric threshold algorithms in glaucoma. Invest Ophthalmol Vis Sci. 1999;1999:1998–2009.
Anderson DR Patella VM . Automated Static Perimetry. 2nd ed. St Louis: Mosby; 1999.
Turpin A McKendrick AM . Observer-based rather than population-based confidence limits for determining probability of change in visual fields. Vision Res. 2005;45:3277–3289. [CrossRef] [PubMed]
Treutwein B . Adaptive psychophysical procedures. Vision Res. 1995;35:2503–2522. [CrossRef] [PubMed]
Turpin A McKendrick AM Johnson CA Vingrys AJ . Properties of perimetric threshold estimates from Full Threshold, ZEST, and SITA-like strategies, as determined by computer simulation. Invest Ophthalmol Vis Sci. 2003;44:4787–4795. [CrossRef] [PubMed]
Gardiner SK Crabb DP . Examination of different pointwise linear regression methods for determining visual field progression. Invest Ophthalmol Vis Sci. 2002;43:1400–1407. [PubMed]
Spry PGD Johnson CA Bates A Turpin A Chauhan BC . Spatial and temporal processing of threshold data for detection of progressive glaucomatous visual field loss. Arch Ophthalmol. 2002;120:173–180. [CrossRef] [PubMed]
King-Smith P Grigsby S Vingrys A Benes S Supowit A . Efficient and unbiased modifications of the QUEST threshold method: theory, simulations, experimental evaluation, and practical implementation. Vision Res. 1994;34:885–912. [CrossRef] [PubMed]
Turpin A McKendrick AM Johnson CA Vingrys AJ . Development of efficient threshold strategies for Frequency Doubling Technology perimetry using computer simulation. Invest Ophthalmol Vis Sci. 2002;43:322–331. [PubMed]
Figure 1.
 
Probability of obtaining a measurement using FT (averaged over all starting points) when the patient has a true sensitivity of 4 dB (left) and 10 dB (right) assuming 1% false-positive and -negative rates, and a slope of the FoS curve as 6 dB SD of a cumulative Gaussian function. Bars: actual probabilities; solid line: best-fit Gaussian.
Figure 1.
 
Probability of obtaining a measurement using FT (averaged over all starting points) when the patient has a true sensitivity of 4 dB (left) and 10 dB (right) assuming 1% false-positive and -negative rates, and a slope of the FoS curve as 6 dB SD of a cumulative Gaussian function. Bars: actual probabilities; solid line: best-fit Gaussian.
Figure 2.
 
Slope cutoff criteria for calling progression using PLR derived by sampling 10,000 sequences of a subject stable at 30 or 20 dB and taking the lower 5% quantile of the collected slopes (hence keeping specificity fixed at 95%).
Figure 2.
 
Slope cutoff criteria for calling progression using PLR derived by sampling 10,000 sequences of a subject stable at 30 or 20 dB and taking the lower 5% quantile of the collected slopes (hence keeping specificity fixed at 95%).
Figure 3.
 
Proportion of progressing locations identified, where each point begins at 30 dB and deteriorates as per the label in the bottom right corner of each panel for 6 years. The number of fields measured per year also appears in the bottom right of each panel. (■) Current SAP procedures; points designated x represent a reduction factor (1 − x × 10%) in the standard deviation for the procedure (i.e., 8 is 80% of the SD of current procedures or a 20% reduction in error). Lines are added for clarity. Shaded region: 1 year relative to current procedures (■), hence symbols appearing to the left of the region indicate that progression is determined more than 1 year earlier than current procedures.
Figure 3.
 
Proportion of progressing locations identified, where each point begins at 30 dB and deteriorates as per the label in the bottom right corner of each panel for 6 years. The number of fields measured per year also appears in the bottom right of each panel. (■) Current SAP procedures; points designated x represent a reduction factor (1 − x × 10%) in the standard deviation for the procedure (i.e., 8 is 80% of the SD of current procedures or a 20% reduction in error). Lines are added for clarity. Shaded region: 1 year relative to current procedures (■), hence symbols appearing to the left of the region indicate that progression is determined more than 1 year earlier than current procedures.
Figure 4.
 
As in Figure 3, but with each location beginning at 20 dB.
Figure 4.
 
As in Figure 3, but with each location beginning at 20 dB.
Figure 5.
 
The lighter bars show the number of years needed to reach a sensitivity of 80% (left) and 90% (right) using criteria on the slope of linear regression on MD, such that specificity is fixed at 95%. The number under each bar is the number of locations (of 52) that had a true threshold of 30 dB that decreased by either 1 dB per year (A, B) or 2 dB per year (C, D); the remainder of the true field was stable at 30 dB. (A) and (C) give results if visits are twice per year, whereas (B) and (D) show the results for three visits per year. The darker bars show the reduction in the number of years to obtain 80% or 90% sensitivity that would accrue if a test with 40% less error (60% of current) were used. Bars that were truncated at 6 dB have upward arrows at the top.
Figure 5.
 
The lighter bars show the number of years needed to reach a sensitivity of 80% (left) and 90% (right) using criteria on the slope of linear regression on MD, such that specificity is fixed at 95%. The number under each bar is the number of locations (of 52) that had a true threshold of 30 dB that decreased by either 1 dB per year (A, B) or 2 dB per year (C, D); the remainder of the true field was stable at 30 dB. (A) and (C) give results if visits are twice per year, whereas (B) and (D) show the results for three visits per year. The darker bars show the reduction in the number of years to obtain 80% or 90% sensitivity that would accrue if a test with 40% less error (60% of current) were used. Bars that were truncated at 6 dB have upward arrows at the top.
Figure 6.
 
The standard deviations of the Gaussian used to model procedure error in this study (■ and 6 curves), in addition to three new procedures. Constant FoS (○) assumes a constant FoS spread of two for all levels of stimuli; ZEST 9 (▾) and ZEST 20 (▴) are ZEST procedures with a uniform prior, terminating when the standard deviation of the pdf (probability density function) falls below 1.7 (average, 9 presentations) and 0.5 (average, 20 presentations), respectively.
Figure 6.
 
The standard deviations of the Gaussian used to model procedure error in this study (■ and 6 curves), in addition to three new procedures. Constant FoS (○) assumes a constant FoS spread of two for all levels of stimuli; ZEST 9 (▾) and ZEST 20 (▴) are ZEST procedures with a uniform prior, terminating when the standard deviation of the pdf (probability density function) falls below 1.7 (average, 9 presentations) and 0.5 (average, 20 presentations), respectively.
Figure 7.
 
Test–retest results generated from our computer simulation compared with previously published test–retest data for SAP. Dashed lines: the 95% limits of empiric test–retest data published in Figure 4b of Artes et al. 17 The line plots show the median and the 2.5th and 97.5th quantiles of test–retest results generated by computer simulation. Middle and right panels: the likely test–retest error distributions that would result from a 20% and 40% reduction in the error of our simulated patient model.
Figure 7.
 
Test–retest results generated from our computer simulation compared with previously published test–retest data for SAP. Dashed lines: the 95% limits of empiric test–retest data published in Figure 4b of Artes et al. 17 The line plots show the median and the 2.5th and 97.5th quantiles of test–retest results generated by computer simulation. Middle and right panels: the likely test–retest error distributions that would result from a 20% and 40% reduction in the error of our simulated patient model.
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×