purpose. To describe and evaluate new statistical techniques for detecting topographic changes in series of retinal and optic nerve head images acquired by scanning laser tomography (Heidelberg Retinal Tomograph [HRT]; Heidelberg Engineering, Heidelberg, Germany).

methods. Proven quantitative techniques, collectively referred to as *statistic image mapping* (SIM), are widely used in neuroimaging. These techniques are applied to HRT images. A pixel-by-pixel analysis of topographic height over time yields a statistic image that is generated by using permutation testing, derives significance limits for change wholly from the patient’s own data, and removes the need for reference data sets. These novel techniques were compared to the Topographic Change Analysis (TCA super-pixel analysis) available in the current HRT software, by means of an extensive series of computer experiments. The SIM and TCA techniques were further tested and compared to linear regression of rim area (RA) against time, in real longitudinal HRT series of eyes of 20 normal subjects and 30 ocular hypertensive (OHT) patients that were known to have converted to glaucoma, on the basis of visual field criteria.

results. Computer simulation indicated that SIM has better diagnostic precision at detecting change. In the real longitudinal series, SIM flagged false-positive structural progression in two (10%) of normal subjects, whereas TCA identified three (15%), and linear regression of RA against time identified two (10%). SIM identified 22 (73%) of the OHT converters as having structural progression, whereas the TCA and linear regression of RA against time each identified 16 (53%) over the course of the follow-up.

conclusions. SIM has better diagnostic precision in detecting change in series of HRT images when compared to current quantitative techniques. The clinical utility of these techniques will be established on further longitudinal data sets.

^{ 1 }

^{ 2 }and typified by the commercially available Heidelberg Retinal Tomograph (HRT; Heidelberg Engineering, Heidelberg, Germany) is widely used in the assessment of the glaucomatous optic nerve head (ONH). Quantitative assessment of these topographic images can separate glaucomatous and normal eyes with generally high levels of diagnostic precision,

^{ 3 }

^{ 4 }

^{ 5 }

^{ 6 }

^{ 7 }with some evidence that this can be done in cases without measurable standard perimetric defects at the time of testing.

^{ 8 }

^{ 9 }The real promise of this technology probably lies in objectively measuring progressive structural damage, or stability, in patients being observed over time, which is possible, because the local height measurements at each of the pixels of a topography image are sufficiently reproducible.

^{ 10 }

^{ 11 }To date, however, analyses of these images have focused on measures derived from the arbitrarily defined disc contour and either applied globally to the whole disc or to predefined segments. One method of monitoring progression using the HRT software is to determine whether differences in summary features of the optic nerve head (stereometric parameters), separated by a specified amount of time, exceed the limits of variability for repeated imaging.

^{ 12 }

^{ 13 }

^{ 14 }

^{ 15 }

^{ 16 }

^{ 17 }considers change over time at the level of groups of pixels within the image: the Topographic Change Analysis (TCA). Now included in the HRT software, this technique divides the image into a 64 × 64-superpixel array. (Each superpixel is 4 × 4 pixels, thus containing 16 pixels.) Change in topographic height in superpixels is quantified with a standard statistical method comparing a set of baseline images to the most recent follow-up images.

^{ 16 }

^{ 17 }

^{ 18 }We propose to exploit part of this catalog of proven methods, specifically the techniques collectively referred to as

*statistic image mapping*(SIM) which are used for determining areas of activity and change in series of MRI- and PET-type images, by applying them to series of retinal and optic nerve head topography images. In particular, we use a nonparametric version of these techniques. These are intuitive to understand, and assessment of change in the image is based solely on the subject’s own data and within-subject image variability, rather than any a priori information or patient population characteristics.

^{ 1 }

^{ 2 }Briefly, the HRT uses a low-intensity diode laser and obtains 32 equally spaced confocal sections, centered on the optic disc and perpendicular to the optical axis of the eye. The 32 sections, each having an area of 256 × 256 pixels, are aligned to compensate for lateral eye movements during acquisition. A 3-D reconstruction of the image area is obtained by calculating the positions of maximum reflectivity along the

*z*-axis, providing an image with discrete topographic height values at 65,536 (256 × 256) pixels. Typically, at each clinical visit, multiple scans are obtained in a subject, usually three.

^{ 19 }A known characteristic of patients with progressing glaucoma is increasing ONH excavation and nerve fiber layer thinning with time, often referred to as structural progression.

^{ 20 }

^{ 21 }

^{ 22 }The ideal clinical tool for assessing a longitudinal set of these HRT images would highlight this structural progression as localized areas of the ONH that are changing beyond the natural within-test and between-test variation in the images.

*statistic image map*. The entire statistic image must be assessed for significant effects, using a method that accounts for the inherent multiplicity involved in testing all the pixels at once. This analysis can be accomplished in a classic parametric statistical framework,

^{ 23 }

^{ 24 }but we used an alternative that is based on

*permutation testing*. The latter is conceptually simple, does not rely on any theoretical probability models that may or may not be appropriate for the HRT data, deals with the multiple comparison problem of testing a vast image space, and critically derives significance limits for change based only on the individual patient’s series of data. These specific techniques and the mathematics that underpin them, as applied to PET and MRI data, are extensively described elsewhere.

^{ 25 }

^{ 26 }

^{ 27 }

^{ 28 }

^{ 29 }

^{ 30 }

^{ 31 }What follows is a description of three component parts of this approach and how we applied them to a series of HRT data, with the descriptive order chosen to facilitate the explanation of the methods, rather than to replicate the computational sequence, details of which can be found in the

^{Appendix}.

*statistic image*—no longer a physiological image, but a 256 × 256-pixel map of statistics summarizing change within the image. The next step is to determine whether the observed test statistic at each pixel is unusual, or more extreme, than would be expected by chance. This testing of the

*significance*of the test statistic is not completed in the conventional manner, by considering the observed test statistic as a random variable from a probability model, but uses a

*permutation test.*We randomly shuffle, or relabel, the order of the observed data and recalculate the test statistic for all possible permutations of the order of images. If we let

*N*denote the number of all possible labelings,

*t*

_{ i }the statistic corresponding to labeling

*i*, then the set of

*t*

_{ i }for all possible relabeling constitutes the

*permutation distribution.*For example, there would be 369,600 [12!/(3! × 3! × 3! × 3!)] of these in a series of four clinical visits with three scans at each visit (see

^{Appendix}for more details of this calculation). We then assume that all the

*t*

_{ i }are equally likely and determine the significance of the observed test statistic by counting the proportion of the permutation distribution as, or more, extreme than our observed value, giving us our

*P*-value. If

*P*is, for example, <5% we label the pixel as active or changing. (We therefore assume that images acquired at the same visit are no more correlated than images acquired between visits. Previous work on the influence of time separation on interimage topographic variability support the intuition behind this approach.

^{ 32 }) This

*permutation test*is performed pixel by pixel, and the statistic image becomes “thresholded” at the 5% level, with pixels flagged if they are significant (Fig. 1) . In practice, a sample of 1000 randomizations (drawn without replacement from all the possible labelings) are used to generate the permutations distribution.

^{ 33 }

^{ 34 }This eases the computation burden but still allows for a statistically exact test at standard levels of significance testing. (Larger samples would be needed to evaluate

*P*< 0.1%.)

*multiple comparisons problem*and the construction of a corrective analysis for high-dimensional MRI and PET data has occupied many researchers, with ideas ranging from the simple use of Bonferroni adjustments to other mathematical solutions (see Nicholls and Holmes

^{ 29 }). In this work we again exploited an intuitive approach, once more using a permutation test, which has been successfully applied to sequences of MRI and PET images, and outperforms other approaches when there are few images involved (or experiments with low degrees of freedom). Once we had thresholded the statistic image pixel by pixel (Fig. 2) , we were left with an image that contained clusters of contiguous, significant, or active pixels. We then noted the size of the largest cluster in the observed image. To ascertain whether the spatial extent of the clusters in the observed image was unusually large by chance alone, we set about shuffling the images again, recomputed the statistic image, calculated the cluster sizes, and recorded the size of the largest cluster. (In fact, the shuffling for the pixel-by-pixel analysis and the cluster testing is all accomplished in one “sweep” in the computational algorithm). This procedure was repeated, to generate a permutation distribution of the maximum cluster size (Fig. 2) . Hence, we assessed the significance of the observed result by considering only the patient’s data, and no knowledge of the probabilistic behavior of the topographic heights at image pixels was required. This is particularly useful because of the

*spatial correlation*that exists within the image (i.e., the topographic height of neighboring pixels is more similar in some parts of the image than in others) and, in part, this cluster testing accounts for this. The threshold value generated to determine progression was unique for each patient and varied depending on the patient’s signal-to-noise ratio. The criteria for progression included only depressed clusters (a continuous set of active pixels with negative slopes) bound within the contour line for the optic disc.

*pseudo test statistic*. Rather than divide the 256 × 256 matrix of individual slope values by the 256 × 256 matrix of individual SEs to yield the test statistic, the slope values are divided by a spatially filtered SE. The latter is the matrix of SEs smoothed with a weighted Gaussian kernel. Thus, a pseudo-test-statistic image is formed by dividing the slope matrix by the smoothed SE matrix. Hence, all the analyses, including the permutation cluster testing, proceeded with these pseudo test statistics. In essence, the noise from the variance image (the matrix of SEs) is smoothed, but not the signal. Statistic image maps constructed with smoothed variance estimates have been shown to improve the power of the approach substantially and can only be used in the nonparametric or permutation setting outlined herein.

^{ 26 }

^{ 27 }

^{ 28 }

^{ 29 }We therefore included this in our approach.

^{ 17 }Any virtual patient who showed a cluster of 20 or more significant superpixels bound within the contour line for the optic disc, where the topographic change compared with baseline occurred in three consecutive sets of follow up images, was considered to have confirmed progression of glaucoma.

^{ 35 }

^{ 36 }In this study, simulation experiments were designed to quantify the specificity and sensitivity of our technique and the TCA superpixel method. Subjects with stable or unstable images were simulated. Those with unstable images had gradual and episodic change applied to a region of the neuroretinal rim. Each subject comprised a longitudinal series of 30 images: 10 sets of 3 images (baseline set, and nine follow-up visits).

*x*′,

*y*′, and

*z*′) and rotations about each axis (σ

_{ x }′, σ

_{ y }′, and σ

_{ z }′). Transformations are applied at a subpixel level, using bicubic interpolation algorithms.

^{ 37 }The magnitude of each of the six transformations was made unique in each simulation by using a random number sampled from a normal distribution, wherein the mean of the size of the transformation is set at zero and a variance is fixed for each transformation. To mimic background noise, Gaussian noise was added to each pixel with variance

*v*and mean zero. (A proven nonbiased random-number generator was used to sample from a normal distribution.

^{ 38 }) To replicate the repeatability of topographic height measurements in clinical data, groups of virtual subjects were simulated having a mean pixel height standard deviation (MPHSD) of 15, 25, or 35 μm. The result of applying movement noise mimicked the between-image variability illustrated in previous studies, with higher variation in areas of high gradient change, such as across blood vessels and in the cup.

^{ 39 }

^{ 40 }Each simulated series was stored to computer disc, allowing the specificity and sensitivity of both techniques to be evaluated on identical image series.

^{Appendix}, recording for each patient series the visit at which (false-positive) change was first detected. We then applied the TCA method to the same data set, again recording for each patient series the visit at which (false-positive) change was first detected.

^{ 12 }

^{ 13 }In short, OHT patients had an intraocular pressure (IOP) of ≥22 mm Hg on two or more occasions, two initial reliable visual field results with AGIS (Advanced Glaucoma Intervention Study) score of 0, absence of other significant ocular disease that would affect visual field performance, and age >35 years. The eligibility criteria for the normal subjects included IOP consistently <22 mm Hg, baseline reliable visual field results with an AGIS score of 0, no significant ocular disease, no family history of glaucoma, and age >35 years. A reliable visual field was defined as <25% fixation errors, <30% false-positive errors and <30% false-negative errors. The normal subjects were followed up concurrently with the OHT patients. The study was in compliance with the guidelines established in the Declaration of Helsinki.

*P*< 0.05).

*P*< 0.001). In case 2, SIM identified progression at visit 6, whereas the TCA did not detect progression at all, the linear regression of RA against time was reached slight statistical significance at visit 7 (

*P*= 0.042).

^{ 14 }

^{ 15 }

^{ 41 }These methods may be subject to similar inadequacies associated with using the global indices to summarize progression in visual fields: chiefly loss of spatial information and poor sensitivity to identify the localized damage.

^{ 42 }

^{ 43 }In this work, we have presented and evaluated statistical procedures for the analysis of pixel level longitudinal data in structural HRT images by using existing techniques primarily developed for neuroimaging data.

^{ 16 }but series of confirmation tests and a requirement for a certain cluster size were needed to produce similarly adequate levels of diagnostic precision in real longitudinal image data sets.

^{ 17 }This necessity is not surprising, as the original simulations centered on a single superpixel rather than results across the whole image. A statistical adjustment (the Satterthwaite correction) was used to correct for similarity (spatial correlation) of the topographic height

*within*a superpixel, but no real account was made for the multiplicity of testing across the whole image. The empiric solution to the problem of multiplicity of testing included the requirement for clusters of pixels to be above a certain size, based on observed series of normal subjects.

^{ 17 }However, this empiric solution is based on observations of variability within a population and not on observed variability within a subject’s own data series. One possible reason our technique outperforms this analysis in our computer simulation, and in real longitudinal data, is that it inherently corrects for the multiple comparison problem. Handling this aspect of imaging data is one of the key features of the SIM approach.

^{ 33 }With the increased computational power now available, there seems to be no important argument against their preferred use in situations in which there may be arbitrary properties of the observed data that cannot be accounted for by a probability model. The most plausible explanation, however, for the better diagnostic precision of SIM in comparison to the TCA technique in these computer experiments is simply the use of the whole series of the data. The TCA method uses only the baseline images and three follow-up images. This may be reasonable when the follow-up is short, but when the available series of data lengthens beyond four visits, it results in considerable data redundancy, as illustrated in Figure 3when the difference between the two methods appears approximately halfway through the potential follow-up of 10 visits. It is also interesting to note that there is no discernible difference between the power of the methods when episodic or sudden loss is specified (Fig. 3) . This aspect of the results is reassuring, because our choice of the pixel-by-pixel test statistic is essentially a rate (trend) parameter, which might not be considered sensitive to detecting a sudden change. However, we have reported for threshold measurements in the visual field that linear regression adequately identifies sudden change, unless a series of data becomes very long.

^{ 44 }At the same time, there is a real advantage of using a rate parameter, as it may provide clinically interpretable information once the technique has identified a significant region of change. Of course, there is no firm evidence about structural loss in glaucoma being either gradual or sudden, but it seems the new technique described herein will be sufficiently sensitive to both types of deterioration.

^{ 45 }Furthermore, these techniques should not be confined to one type of retinal imaging modality and could be applied, for example, to series of images acquired from optical coherence tomography and scanning laser polarimetry.

^{ 33 }

^{ 34 }These sample data ease the computation burden but still allow for a statistically exact result at standard levels of significance testing. (Larger samples would be needed to evaluate

*P*< 0.01.)

*s*is the number of scans per visit and

*n*is the number of visits. For example, with four visits and three scans per visit, there are

- At each
*pixel*(*i*,*j*) calculate by least-squares linear regression the slope*b*(*i*,*j*), SE*se*(*i*,*j*), and absolute test statistic*t*(*i*,*j*) of time (dependent variable) against topographic height (independent variable). - Shuffle the order of the dependent variable (time) to generate a unique permutation and recalculate
*b*,*se*, and*t*. - Repeat step 2 1000 times, calculating a unique permutation each time. As each permutation must be unique, the algorithm must perform sampling without replacement.
- We reject the null hypothesis at a significance level of
*P*< 0.05. Thus, for the mechanics of the permutation distribution, we reject the null hypothesis if the observed test statistic is greater than or equal to the 95th percentile of the permutation distribution. Therefore, sort the array of test statistic*t*produced at each*pixel*(*i*,*j*) in ascending order, and test whether the absolute observed test statistic is equal to or more than the 950th (0.95 × 1000) value of*t*. Note that we retain the sign of the observed test statistic to indicate the direction of change—that is, a negative sign indicates a depression in topographic height values over time, whereas a positive sign indicates an elevation in topographic height over time.

*tstat*(

*i*,

*j*) is calculated by dividing slope

*b*(

*i*,

*j*) with a smoothed SE

*se*(

*i,j*). The smoothed SE is calculated by convolving the SE

*se(i*,

*j*) with a Gaussian kernel. We used a square Gaussian kernel of symmetrical full width at half maximum of 11 and size 17 × 17 to smooth the SE

*se*(

*i*,

*j*). The pseudo test statistic is calculated for the observed case and for each unique permutation.

- Compute the observed pseudo test statistic.
- Compute the pseudo test statistic for each unique permutation.
- Compute an observed statistical image
*s*(*i*,*j*) by setting*s*(*i*,*j*) to equal*active*_*depressed*or*active*_*elevated*, if the observed absolute pseudo test statistic is within or higher than the 95th percentile of the permutation distribution of the absolute pseudo test statistic at*pixel*(*i*,*j*). Record the size of the maximum depressed and elevated clusters within the observed statistic image, bound within the contour line. An active pixel within a statistical image*s*(*i*,*j*) is defined as part of a continuous cluster if one of the eight pixels within its neighborhood is also active (i.e., 8-connectivity). - Compute a statistic image at each of the 1000 unique permutations. Record the size of the maximum depressed and elevated clusters for each unique permutation, bound within the contour line.
- Sort the array of maximum clusters into ascending order.
- A depressed or elevated cluster (or clusters) within the observed statistic image is defined as statistically significant if it (or they) are larger than the 99th percentile of the maximum depressed and elevated cluster distributions. Progression is defined if a depressed cluster is larger than the 99th percentile of the maximum depressed distribution.

**Figure 1.**

**Figure 1.**

**Figure 2.**

**Figure 2.**

**Figure 3.**

**Figure 3.**

SIM | TCA | RA | |||||||
---|---|---|---|---|---|---|---|---|---|

n | (%) | n | (%) | n | (%) | ||||

Control subjects | 2 | (10) | 3 | (15) | 2 | (10) | |||

Converters | 22 | (73) | 16 | (53) | 16 | (53) |

**Figure 4.**

**Figure 4.**