**Purpose.**:
The test–retest variability of standard automated perimetry (SAP) severely limits its ability to detect sensitivity decline. Numerous improvements in procedures have been proposed, but assessment of their benefits requires quantification of how much variability reduction results in meaningful benefit. This article determines how much reduction in SAP procedure variability is necessary to permit earlier detection of visual field deterioration.

**Method.**:
Computer simulation and statistical analysis were used. Gaussian distributions were fit to the probability of observing any sensitivity measurement obtained with SAP and the Full Threshold algorithm to model current variability. The standard deviation of these Gaussians was systematically reduced to model a reduction of SAP variability. Progression detection ability was assessed by using pointwise linear regression on decreases of −1 and −2 dB/year from 20 and 30 dB, with a custom criteria that fixed detection specificity at 95%. Test visits occurring twice and thrice per annum are modeled, and analysis was performed on single locations and whole fields.

**Results.**:
A 30% to 60% reduction in SAP variability was required to detect pointwise deterioration 1 year earlier than current methods, depending on progression rate and visit frequency. A reduction of 20% in variability generally allowed progression to be detected one visit earlier.

**Conclusions.**:
On average, the variability of SAP procedures must be reduced by approximately 20% for a clinically appreciable improvement in detection of visual field change. Analysis similar to that demonstrated can measure the improvement required of new procedures, assisting in cost–benefit assessment for the adoption of new techniques, before lengthy and expensive clinical trials.

^{ 1,2 }High variability makes the detection of visual field change difficult, with progressive conditions such as glaucoma requiring large sequences of test data from many patient visits to detect clinical change.

^{ 3 –6 }

^{ 7,8 }or to facilitate rapid completion while maintaining traditional test–retest variability.

^{ 9 –12 }As the instructions given to patients can significantly affect perimetric outcomes,

^{ 13 }another approach might be to provide more consistent and informative instructions to patients, combined with more effective training of staff involved in administering visual field tests.

^{ 14 }An alternate proposed method for reducing perimetric variability is to alter the test stimuli in a manner that results in steeper psychometric functions. For example, increasing from a size III to a size V target results in steeper FoS curves.

^{ 15 }Steeper FoS curves have also been proposed as an advantage of using the low spatial, high temporal frequency stimuli of the frequency-doubling perimeter; however, the comparison is not straightforward because the test dynamic range is restricted relative to SAP.

^{ 16,17 }Keeping test–retest variability consistent across the test stimulus range has also been the goal of new stimuli proposed by Hot et al.

^{ 18 }FoS curves can also be manipulated by altering the attentional demands of the task.

^{ 19 }

^{ 10 }Clinical studies demonstrate qualitatively similar threshold estimates and test–retest sensitivity between FT and SITA,

^{ 1,2,9,20,21 }with SITA thresholds being, on average, approximately 1 dB higher. Using this model, we examined the ability to detect linear decreases in visual field estimates using pointwise linear regression (PLR) with a fixed specificity of 95% when the standard deviations of the test procedure error distributions were systematically reduced. We also examined the ability to detect linear decreases in Mean Defect scores for whole fields with a varying number of linearly deteriorating visual field locations. Finally, we examined how tests might be modified to achieve error reductions capable of detecting visual field deterioration 1 year earlier than current. We now describe each method in detail.

^{ 22 }as the base procedure for our work. It is a staircase procedure that modifies stimulus luminance in steps of 4 dB until the first response reversal occurs and subsequently in steps of 2 dB. The HFA implementation of FT terminates after two response reversals and takes the stimulus luminance of the last-seen presentation as the final sensitivity estimate for a given location. In our implementation of FT, if the first estimate was more than 4 dB away from the starting point of the staircase, then the procedure was repeated using the first sensitivity estimate as the starting point, and the result of this second staircase was taken as the final estimate of sensitivity.

^{ 23 }The leaves of the tree represent possible outcomes from the procedure with an associated probability. An example of two probability distributions derived in this study is shown in Figure 1.

^{ 24 }where

*fp*is the false-positive rate defining the lower asymptote of Ψ;

*fn*is the false-negative rate defining the upper asymptote of Ψ;

*s*is the standard deviation of a cumulative Gaussian defining the spread of Ψ;

*t*is the threshold, or translation of Ψ, along the abscissa; and

*G(x*,

*t*,

*s*) is the value at

*x*of a cumulative Gaussian distribution with mean

*t*and SD

*s*. For these experiments, we assumed that the patient would make few errors, and so we set the false-positive and -negative rates to 1%. With reducing visual field sensitivity, the psychometric function slope flattens for size III SAP targets.

^{ 14,15 }Our model accounts for this dependence of variability on sensitivity by varying the standard deviation of the Gaussian with sensitivity by using

*s*= exp(−0.066 ×

*t*+ 2.81) as previously reported for clinical data

^{ 14 }capped at a maximum of 6 dB. In the two examples shown in Figure 1, we assumed that all starting points for the FT procedure are equally likely and summed and normalized the 41 possible distributions (start points of 0,1… 40 dB) into the ones shown.

*t*from 0 to 40 dB, then for each distribution we fit a Gaussian:

*G*(0… 40,

*m*,

_{t}*s*) using the nonlinear minimization (nlm) function in R to minimize the

_{t}*L*

_{1}norm) (see Fig. 1 for examples where

*t*= 4 and

*t*= 10 dB). Because we assumed a high variability in patient response (flat FoS curve) when true sensitivity was low, there was a “floor effect” where FT returned 0 dB very often. This effect can be seen in Figure 1, left, where, although the true sensitivity is 4 dB, 0 dB occurs nearly as often as 3 dB. This situation arises when a patient does not see 0 dB twice; then, FT terminates and returns a sensitivity of 0 dB, resulting in multimodal distributions for low true sensitivities, as in Figure 1, left, making the Gaussian fit poor. This phenomenon has been widely reported in clinical studies of the test–retest distribution of SAP thresholds,

^{ 1,2,16 }where the distribution of retest values is skewed to lower sensitivities, particularly for low test sensitivities. We note in passing that if we sample test and retest values from our fitted Gaussian distributions (raising any observed negative values to 0), we get very similar skew distributions, and so the use of symmetrical Gaussians to model measured–given–true sensitivities is not inconsistent with the observed asymmetrical test–retest distributions reported in the literature. Examples of such distributions are shown in the Results section.

^{ 1,25 }We “recenter” the distributions by setting the mean to the true sensitivity value of the patient. Thus, the distribution of possible outcomes from FT for true sensitivity

*t*is described by

*G*(0… 40,

*t*,

*s*).

_{t}*t*,

*s*), we get a measured sensitivity value that is typical of current SAP procedures. As the purpose of this work was to investigate the benefits of improving SAP variability, we simulated improved procedures by systematically reducing

_{t}*s*in steps of 10%, sampling from

_{t}*G*(0… 40,

*t*, 90% ×

*s*),

_{t}*G*(0… 40,

*t*, 80% ×

*s*), and so on. We also examined the ability to detect progression if the FoS slopes were consistent across the range of available sensitivity estimates—that is, if there was no flattening of psychometric function slope with decreasing sensitivity. Avoiding a flattening of FoS with sensitivity is the goal of several research groups exploring alternate stimulus types to current size III SAP targets.

_{t}^{ 15,18 }

*P*< 0.01).

^{ 26 }All of these criteria are based around clinical data collected using current SAP procedures. It is likely that these existing criteria will be too specific and conservative for determining change for procedures that are modeled to have significantly less errors than current techniques.

*G*(0… 40,

*t*,

*s*), and the true baseline sensitivity for any location was 30 dB. Thus, it was possible to derive precise 95% confidence intervals for the slope of a regression line on a sequence of measurements that did not change. Repeatedly generating sequences of measurements of various lengths (numbers of visits) drawn from the Gaussian distribution

_{t}*G*(0… 40, 30,

*s*

_{30}= 1.65), for a baseline of 30 dB or

*G*(0… 40, 20,

*s*

_{20}= 3.17) for a baseline of 20 dB gives the distribution of the slope of a regression line through these sequences. Then, for each number of visits (sequence length) we can take the lower 5% quantile slope value as a cutoff point for determining change with a specificity of 95%. The number of visits per year, assuming we wish to express the cutoff slope in decibels per year, must also be included in the repeated regressions as a scaling of the independent variable (time). Figure 2 shows the values used in this study.

^{ 26,27 }When the number of visits per year was increased to 3, the sensitivity for detecting a change of −1 dB/year increased to 89% (bottom left panel) and to 100% for −2 dB/year.

*x*-axes of each panel) are linearly decreasing in sensitivity from 30 dB. Figures 5A and 5B show results for progression of 1 dB/year for two visits and three visits per year, respectively. The left-hand panels provide results for 80% sensitivity to call progression, whereas the right-hand panels show 90% sensitivity. Figures 5C and 5D are similarly formatted, but for the more rapidly progressing case (2 dB/year). In each panel, the light bars show the time (number of years) taken to reach the criterion sensitivity for detecting progression (80% or 90%). As expected, this time decreased with the increase in number of progressing locations (1–10) and is reached more rapidly for 80% sensitivity than 90% sensitivity. The dark bars show the time improvement gained if the procedure error is reduced by 40%. We explored the improvement for a range of different error reductions (as per Figs. 3, 4) but show here the 40% condition as it improved the ability to detect progression by approximately 1 year on average.

^{ 15,17 }). To investigate the potential benefits of these approaches, we used our model of perimetric variability to calculate the error distribution of several alternate theoretical procedures. The method was the same as described above for determining the error distribution for the FT procedure, substituting either a different test algorithm or FoS curve. Figure 6 shows the resultant standard deviation of the Gaussian fitted to the error distributions for the following procedures:

- A ZEST procedure (a Bayesian adaptive procedure used in the Humphrey Matrix Perimeter [Carl Zeiss Meditec] and described fully elsewhere
^{ 28,29 }). The procedure has been shown to return less variable thresholds on average when run for a comparable number of presentations as FT. We tried several ZEST procedures and show here an implementation that reduced perimetric variability by approximately 40%. Note, to achieve this reduction in variability, ZEST required an average of 20 presentations to terminate (termination criteria being set to when the standard deviation of the posterior distribution was less than 0.5 dB), and was seeded with a uniform prior distribution. Figure 6 includes a version of the same ZEST with nine presentations on average to terminate (termination criteria: standard deviation of the posterior distribution was less than 1.7 dB). These procedures are included to be indicative of the types of performance gains expected by improving test algorithms for SAP simply by asking more questions of the observer to improve the reliability of estimates. - The FT procedure using a constant FoS spread of 2 dB across the entire stimulus range tested. This model is included to represent a theoretical new procedure for which psychometric function slope is constant with decreasing sensitivity.

*x*-axis, with the dashed lines showing test–retest data from Figure 4b of Artes et al.

^{ 17 }The experimental results were generated assuming a Gaussian model with standard deviations given by 80% × exp(−0.06 × (0… 35) + 2.4), which is 80% of the best fit of FoS for normals.

^{ 14 }This was chosen so that the 95% confidence intervals roughly equalled those reported previously,

^{ 17 }which illustrates that test–retest information similar to that encountered clinically can be generated by our Gaussian model of error (Fig. 7, leftmost panel). The “patient” testing methodology used to generate the results is the same as that of Artes et al.

^{ 17 }: each model patient generated six visual fields, and all combinations of pairs were taken as baseline and follow-up in turn. The true thresholds of the patient population had 200 thresholds of each value from 0 to 35 dB. The middle and right-hand panels illustrate the predicted test–retest distribution with a 20% and 40% reduction in error, respectively.

^{ 7,8 }

^{ 16,17 }In terms of a decibel scale, FDT has a significantly reduced dynamic range relative to SAP, creating a numerical floor in terms of contrast scaling. A simple comparison of decibel scaling is not a direct correlate of the ability of these stimulus types to detect progression or to quantify more advanced visual field loss, however, due to marked differences in the spatial extent of the stimuli and their different temporal properties.

^{ 15 }), combined with altering the procedure to maintain attentional demands throughout the test

^{ 18 }and running the procedure for longer in areas of interest (while perhaps decreasing the number of presentations required elsewhere). New psychophysical stimuli may confer the added advantage of assessing alternate aspects of visual function that are damaged either earlier or later in the disease process relative to contrast detection of size III targets, hence assisting in the monitoring of both early- and late-stage disease progression.

^{ 14 }; however, there is a scarcity of data for areas of low sensitivity, and modeling the FoS is difficult due to floor effects; hence, we chose to incorporate a conservative error model. In any event, because our assumed model of patient variability is conservative, the reductions in SAP variability we have outlined are lower bounds on those that would actually be required with more variable patients.