purpose. To investigate the accuracy and precision of threshold estimates returned by two Bayesian perimetric strategies, staircase-QUEST or SQ (a Swedish interactive threshold algorithm [SITA]-like strategy) and ZEST (zippy estimation by sequential testing), and to compare these measures with those of the full-threshold (FT) algorithm.

methods. A computerized visual field simulation model was developed to compare the performance (accuracy, precision, and number of presentations) of the three algorithms. SQ implemented aspects of the SITA algorithm that are in the public domain. The simulation was tested by using standard automated perimetry (SAP) visual field data from 265 normal subjects and 163 observers with glaucomatous visual field loss and by exploring the effect of response variability and response errors on algorithm performance.

results. SQ was faster than FT or ZEST, with a comparable mean error when simulating field tests on patients. Point-wise analysis revealed similar error and standard deviation of error as a function of threshold for FT and SQ. If the initial estimate of threshold for either procedure was incorrect, the means and standard deviations of the error increased markedly. ZEST produced more accurate thresholds than did the other two strategies when the initial estimate was removed from the true threshold.

conclusions. When simulated patients made errors, the accuracy and precision of sensitivity estimates were poor when the initial estimate of threshold either overestimated or underestimated the true threshold. This was particularly so for FT and SQ. ZEST demonstrated more consistent error properties than the other two measures.

^{ 1 }

^{ 2 }

^{ 3 }

^{ 4 }

^{ 5 }

^{ 6 }SITA Standard reduces the test time for assessment of the central 30° of the visual field by up to 50% compared with the test times required by the FT strategy.

^{ 5 }

^{ 6 }The reduction in test duration is achieved in several ways

^{ 4 }

^{ 7 }: (1) more efficient threshold estimation based on maximum likelihood principles results in a reduced number of presentations; (2) false-positive responses are estimated without the use of catch trials; (3) the interstimulus interval is altered to match the patient’s speed of response; and (4) SITA repeats testing if the threshold returned is more than 12 dB from an initial estimate of threshold, whereas FT repeats if the threshold is more than 4 dB from the initial estimate.

^{ 8 }

^{ 9 }

^{ 10 }

^{ 11 }ZEST has been shown to determine efficiently the thresholds for frequency-doubling technology (FDT) perimetry

^{ 10 }

^{ 11 }and is available commercially for SAP in the Medmont perimeter (Medmont Pty. Ltd., Camberwell, Victoria, Australia) and in the Humphrey Matrix, a new FDT perimeter. As it is based on maximum likelihood principles, ZEST shares some features with SITA but is computationally simpler.

^{ 5 }

^{ 6 }

^{ 12 }

^{ 13 }

^{ 14 }SITA standard has also been shown to have lower global test-retest variability in comparison with FT estimates.

^{ 14 }

^{ 15 }

^{ 16 }However, newer strategies are computationally more complicated than first-generation staircase strategies, and a full understanding of their performance may not be revealed by such global comparisons. This is evidenced by a recent study by Artes et al.

^{ 16 }which provides a detailed examination of differences between FT and SITA strategies and reveals that the differences in threshold estimates returned by these procedures vary with threshold in a nonlinear manner.

^{ 1 }

^{ 13 }

^{ 14 }

^{ 16 }This lack of precision means that many repeated tests are required to obtain a reliable threshold estimate, which has practical limitations when testing patients. Furthermore, it is impossible to evaluate the accuracy of the mean threshold estimate obtained from repeated testing, because a patient’s true threshold is not known. Hence, thresholds returned by FT are not an adequate standard against which to measure the accuracy of other strategies. Computer simulation of visual field assessment is the ideal tool for evaluating test performance and has been successfully applied to the study of perimetric algorithms.

^{ 2 }

^{ 4 }

^{ 11 }

^{ 11 }except that the present investigation was applied to SAP. The simulation reads an input threshold and then applies a test procedure. In the simplest mode, the simulation assumes an observer without response variability, such that any stimulus presented at a lower luminance (higher decibels) than the input threshold cannot be seen (“no” response). Likewise, any stimulus presented at a higher luminance (lower decibels) results in a “yes” response. If the stimulus is presented at a luminance equal to the input threshold, then a “yes” or “no” response is chosen with equal probability. The procedure is run to completion and a threshold estimate output. The output threshold is compared with the input threshold to determine error, and the number of presentations required is also assessed.

^{ 17 }With this approach, four seed locations have the threshold estimated by using the mean sensitivity of 541 normal patients as a starting value. These four locations are marked A in Figure 1 , which shows a 24-2 stimulus presentation pattern in the format for a left eye. Once these four locations have been tested, their threshold values are used as the initial estimate for their immediate neighbors—points labeled B in Figure 1 . Remaining points derive their initial estimates by averaging their immediate neighbors that have already been tested. The averaging process is restricted so that it does not cross the horizontal midline, but it may cross the vertical midline. The simulation assumed that all A locations were fully determined before beginning any B locations. Similarly, all B locations were determined before commencing C locations and all C’s completed before commencing any of the locations labeled D.

^{ 17 }It consists of a staircase procedure that begins with 4-dB luminance changes until the first response reversal (seeing to nonseeing or vice versa). After the first response reversal, the step size is reduced to 2 dB. The procedure terminates after two reversals, and the threshold estimate is the last-seen intensity. If the difference between the measured threshold and the initial estimate is greater than 4 dB then a second staircase is initiated.

^{ 17 }The current estimate is used to derive the starting value for the second staircase. In cases in which a second staircase was initiated, our simulation reported the threshold estimate as the mean of the two staircase results.

^{ 17 }We did not implement these double determinations, because we are determining precision by replicating the simulation multiple times. Hence, FT assessment using the HFA requires, on average, 50 to 60 more presentations per visual field than reported herein.

^{ 11 }The ZEST procedure is based on a maximum-likelihood determination described elsewhere.

^{ 8 }

^{ 9 }For each stimulus location, an initial probability density function (pdf) is defined that states, for each possible threshold, the probability that any patient will have that threshold (after adjusting for normal aging effects). We used the combined pdf approach recommended by Vingrys and Pianta,

^{ 9 }where the pdf is a weighted combination of normal and abnormal thresholds. The normal pdf gives a probability for each possible patient threshold, assuming that the location is “normal,” whereas the abnormal pdf gives probabilities assuming the location is “abnormal.” Our normal and abnormal pdfs were derived from empiric data as shown in Figures 2A and 2B . The patient set used to determine these pdfs consisted of 541 normal and 315 glaucomatous visual fields and was different from the input to the simulation. For each location, the lower 95th percentile for normal performance was determined from the 541 normal visual fields. The abnormal pdf was derived from the 315 patients with glaucoma by including only those thresholds that were below the lower 95% percentile for norma subjects. For both normal and abnormal pdfs, threshold estimates were pooled across all locations. For each test location, the normal pdf was adjusted along the threshold axis so that its mode was at the initial estimate of threshold, and then the abnormal and normal pdfs were combined in a ratio of 1:4. A small nonzero pedestal was added to the normal pdf, to ensure that all thresholds were represented with nonzero probability in the combined pdf. This is shown in Figure 2C , for an initial estimate of 32 dB.

^{ 11 }The likelihood function used in our simulations is shown in Figure 2D . After the determination of the new pdf, the new mean is calculated and the stimulus intensity equal to that mean is presented. The process is repeated until a termination criterion is met (in this case, standard deviation of pdf <1.5 dB). The output threshold is the mean of the final pdf.

^{ 4 }The SITA approach to determining thresholds consists of four components:

- An algorithm for estimating an initial estimate of threshold at each location of the visual field based on a “growth-pattern.”
- An algorithm for determining the threshold at each location in the visual field based on a hybrid staircase-QUEST procedure.
- A false-positive estimation technique based on response time.
- A postprocessing phase, in which the information from component 3 is used to modify the results of component 2.

^{ 4 }.) One pf gives the probability for each possible patient threshold, assuming that the location is abnormal, whereas the other maintains probabilities for thresholds that are normal. We begin with the same normal and abnormal pfs as in the ZEST procedure (Figs. 2A 2B) . Before the sequence of stimulus presentations begins for each location, the normal pf is translated along the threshold axis so that its mode aligns with the initial estimate for that particular location.

^{ 4 }to determine whether the variance of either pf is sufficiently narrow to terminate the staircase procedure, where

^{ 4 }Similar to the SITA developers, we tuned ERF in our experiments to obtain the best performance from SQ, and report herein experiments using an ERF of 0.70.

^{ 4 }

^{ 4 }

^{ 5 }

^{ 6 }because double determinations to estimate short-term fluctuation were not included. When simulated patients had no variability, SQ and ZEST had similar mean errors and standard deviations of error. FT, however, underestimated threshold by approximately 1.5 dB, which probably resulted largely from FT’s reporting the last-seen stimulus as the estimate. In patients with low variability, both SQ and ZEST were slightly more accurate than FT (approximately 1 dB) but the precision of the procedures (standard deviation of the error) was approximately equivalent. Both the mean error across the field and its standard deviation increased for all three procedures when patients had high variability.

^{ 5 }

^{ 14 }

^{ 16 }

^{ 18 }Because FT returns the last-seen stimulus as the threshold estimate, a difference of 1 dB should be expected, irrespective of threshold. Artes et al.

^{ 16 }recently demonstrated that the differences between SITA and FT vary with threshold, being highest for intermediate sensitivities. It has also been argued that differences in threshold estimates between the strategies may arise in part due to reduced fatigue for the shorter examinations produced by SITA.

^{ 19 }To explore this issue within our simulation model, the difference between SQ and FT is plotted as a function of threshold in Figure 4 . These data were extracted from the visual field simulations. It can be seen that for most of the range of thresholds, SQ returned estimates of higher sensitivity than FT and that the magnitude of the difference was approximately 1 dB.

^{ 5 }

^{ 6 }

^{ 12 }

^{ 13 }

^{ 14 }

^{ 16 }; however, the threshold estimate returned by FT can be inaccurate and imprecise.

^{ 1 }

^{ 13 }

^{ 14 }

^{ 16 }Such comparisons are essential if patients or clinical trials are to be exchanged from one test procedure to another, but are of restricted utility in understanding the limitations of the procedures for accurately and precisely determining thresholds—essential knowledge for detection of visual field loss and its progression.

^{ 4 }

^{ 5 }

^{ 19 }SQ meets these development goals, and so we assume that it is likely to be representative of the underlying principles of SITA. One further aspect of SITA that is not incorporated in SQ is that SITA alters pfs during the test based on the pfs of neighboring values. The details of these alterations are not published in the literature, and therefore we could not incorporate them in our SQ simulations.

^{ 5 }

^{ 14 }

^{ 16 }

^{ 18 }It has been suggested that the difference between thresholds returned by SITA and FT may be caused in part by a reduction in fatigue in the shorter SITA examination.

^{ 19 }However, several studies have argued that factors other than fatigue are more likely to explain the difference.

^{ 15 }

^{ 16 }

^{ 18 }In addition, our simulation results suggest that the differences between SITA and FT estimates are unlikely to be due to differential effects of fatigue, but rather to the mechanics of the test algorithms. FT returned the last-seen presentation, whereas SQ/SITA returned the most likely mode of the two pfs used in the procedure. As ZEST returned the mean of the final pdf, which provided a less biased estimate than the mode,

^{ 8 }a slightly different threshold again was returned by ZEST, because of this factor alone. Inspection of Figures 4 5 6 7 reveals that the differences in error between SQ and FT varied with threshold, a finding that is broadly compatible with that of Artes et al.

^{ 16 }

^{ 9 }Thresholds were pooled across locations to form the normal and abnormal pdfs resulting in a broader pdf than if locations were treated separately. Initial inspection of location-specific pdfs revealed that the shape of abnormal pdfs was highly aberrant in some locations because of sampling issues—hence, the decision to pool across locations. The broader pdfs produced by pooling create a more uniform combined pdf that increases the number of presentations required for ZEST to terminate with marginal improvements in accuracy and precision.

^{ 8 }

^{ 10 }Although our pdfs were based on empiric thresholds, the specific derivation of pdfs for Bayesian test strategies is somewhat arbitrary. These pdfs may be different from those used in both the commercial application of ZEST on the Medmont perimeter and SITA in the Humphrey Field Analyzer; however, they were based on a large number of empiric thresholds and so may be assumed to represent reasonably the underlying population distribution of thresholds.

^{ 8 }

^{ 20 }The likelihood function used in these experiments was the discrete version of a cumulative Gaussian with a standard deviation of 1.5 dB. This slope is similar to that found for empiric frequency-of-seeing curves measured for SAP in normal observers.

^{ 21 }We also evaluated numerous other likelihood functions within the simulator and found that this function resulted in SQ’s terminating with similar average presentations and precision to that reported for SITA.

^{ 4 }

^{ 5 }

^{ 6 }We maintained the same likelihood function for ZEST to facilitate comparison between the mechanics of the procedures.

^{ 20 }We chose a dynamic termination criterion for ZEST to keep it similar to SQ. The parameters chosen for each of pdf, likelihood function, and termination criterion may be suboptimal; however, optimizing SQ and ZEST falls beyond the scope of this study.

^{ 21 }

^{ 22 }

^{ 23 }Hence, in a given patient responses may range from having no variability to high variability at different locations within their visual field. We present three variability conditions chosen to represent the end points of the range of response variability and patient response errors: no errors and 30% false-positive and false-negative responses (a commonly used cutoff criterion for acceptable performance), as well as the middle of this range, and assess performance for all possible stimulus levels for each of these conditions (Figs. 5 6 7) . An alternate approach would have been to increase response variability with increasing deficit depth. Although this alternate approach may more closely represent average clinical performance, the approach taken provides far greater information regarding the underlying performance of the three algorithms and their tolerance to variability, enabling the assessment of the algorithms for situations that are uncommon but still occur at times (for example, locations in which threshold is normal but the subject’s responses have high variability). In practice, the results with any individual patient may be a hybrid of the three variability models presented and can be determined from the data shown in Figures 5 6 7 . It is also possible that our choice of having equivalent numbers of false positives and false negatives is not representative of typical performance. Indeed, typical patients may be likely to have either 15% false-positive or false-negative responses, but not both. It is to be expected that significant response biases in one direction only (for example false positives) will introduce a more severe systematic error than that shown in our low-variability group, but may reduce the standard deviation of the error.

**Figure 1.**

**Figure 1.**

**Figure 2.**

**Figure 2.**

**Figure 3.**

**Figure 3.**

**Figure 4.**

**Figure 4.**

**Figure 5.**

**Figure 5.**

**Figure 6.**

**Figure 6.**

**Figure 7.**

**Figure 7.**