Abstract
Purpose.:
The goals of this study were to investigate the effectiveness of computerized repeating and averaging of visual acuity measurements in reducing test-retest variability (TRV) and to estimate the increase in sensitivity and specificity that would be achieved in diagnosing visual acuity change.
Methods.:
Timed, paired ETDRS chart and computerized acuity mean measurement (CAMM) were performed in 100 subjects. CAMM(n) scores were the running mean of consecutive measurements. Bland-Altman methods were used to calculate 95% ranges for TRV.
Results.:
The 95% TRV range of ETDRS measurements and the CAMM score after 6 (CAMM6) measurements were, respectively, 8 and 5.7 ETDRS letters (P = 0.02). CAMM6 offered a pragmatically optimum tradeoff between reduced TRV and test time. A measured change of 5 letters or more in the absence of true change was observed in 13% (95% CI, 8%–21%) with the ETDRS chart and 4% (95% CI, 2%–10%) with CAMM6 measurements. To achieve ≥95% test sensitivity (assuming 95% test specificity), change criteria of 15 and 11 letters must be set with an ETDRS chart and CAMM6, respectively. CAMM6 measurement times were longer (mean 234 seconds vs. 74 seconds) for the ETDRS chart.
Conclusions.:
Compared with the current gold standard, computerized repeating and averaging of acuity measurements improve specificity and sensitivity when identifying true changes. The 160-second increase in test time should be set against the considerable economic and clinical benefits that may result.
Detection of true change in clinical status is a fundamental aim of visual acuity testing. However, the error inherent in even gold standard visual acuity measurements is such that changes of 2 logMAR lines may be observed when no actual change has occurred.
1 –10 This measurement error may result in a failure to detect a true change in visual acuity (a false-negative result) or a diagnosis of change when none has occurred (a false-positive result).
Various change criteria (that is, predefined cutoffs for measured changes in visual acuity) have been proposed for use in clinical practice without reference to the underlying sensitivity and specificity of acuity tests for detecting change.
11,12 The use of such change criteria can have important clinical and economic implications. For example, a 5 ETDRS letter deterioration in acuity was a retreatment criterion in the PrONTO study of ranibizumab for neovascular age-related macular degeneration (nARMD). A deterioration in acuity of greater than five ETDRS letters is listed on the European Medicines Agency (EMEA) product label as one of the criteria for retreatment of nARMD with intravitreal ranibizumab.
13,14 However, 11% of subjects with stable nARMD have been reported to show a reduction in visual acuity of five or more ETDRS letters on consecutive visits because of test-retest variability (TRV) alone.
15
In addition to, and as a consequence of, false-positive results and the resultant need to consider only relatively high levels of measured acuity change as diagnostic of change, false-negative results are also a serious problem. For example, using an ETDRS chart, Rosser et al.
16 reported a sensitivity of only 38% to detect a worsening of five ETDRS letters in a population in whom a true acuity change of that magnitude had been simulated. The majority (62%) of persons experiencing that degree of deterioration were missed.
The current gold standard visual acuity test involves measurement of a single letter scoring logMAR acuity score. Repeating and averaging is a standard approach to reducing measurement error, and it has been suggested that this process might usefully be applied to acuity measurements.
17 This is clinically unfeasible using standard hard copy charts. The development of computerized visual acuity measurement systems, including COMPlog
18,19 and the E-ETDRS,
4,20,21 may, however, make this a valid clinical option.
Thus, the aims of the study reported here were to investigate the effectiveness of computerized repeating and averaging of visual acuity measurements in reducing TRV and to estimate the resultant increase in sensitivity and specificity in the diagnosis of visual acuity change.
Approval for this study was obtained from the Research Ethics Committees of Guy's and St. Thomas' Hospital and the London School of Hygiene and Tropical Medicine (London, United Kingdom). The research followed the tenets of the Declaration of Helsinki, and informed consent was obtained from subjects. Consenting adults attending ophthalmology clinics were recruited. Each participant was tested twice using ETDRS charts and twice using the COMPlog visual acuity testing device (i.e., four tests in total) in the same session. Tests were performed in random order on the participant's poorer functioning eye with the participant wearing his or her habitual refractive correction. The randomization sequence was generated using random sampling and random assignment software (Research Randomizer, version 3.0 [Urbaniak GC, Plous S];
http://www.randomizer.org/). All tests were carried out by 1 of 3 examiners under the same illumination conditions, with the participant's fellow eye occluded. Examiners were trained in both measurement techniques before the study began.
One pair of tests involved single letter scoring and timed acuity measurements using ETDRS charts.
22 Charts 1 and 2 were used and displayed in a standard light box (Low Vision Products; Lighthouse International, New York, NY).
23 The end point for all tests was defined as an entire line of letters being misread, with the total number of correctly identified letters forming the basis of the letter score. The charts were read from a distance of 4 meters, with scoring starting from a score of 30 ETDRS letters unless the subject misnamed any letters on the top line. In this event, the subject was moved to 1 meter, and the measurements were taken at this distance with scoring starting at 0 letters.
24
The other pair of tests involved computerized measurement of visual acuity using the COMPlog acuity measurement system. This system has been described in detail elsewhere,
18,19 as has its performance compared with gold standard ETDRS chart single letter scoring acuity measurements and computerized measurements taken using the E ETDRS protocol.
4,20 In brief, the system consists of a software program running on a personal computer that drives a standard primary monitor and a 20.1-inch, 1600 × 1200 resolution LCD flat panel secondary monitor on which the test optotypes are presented. From a test distance of 3 meters, this configuration allows presentation of crowded optotypes sized between 1 ETDRS letter (1.67 logMAR, 20/928 US Snellen) and at least 95 ETDRS letters (−0.2 logMAR, 20/13 US Snellen). (The effects of pixillation in truncating measurements of greater than 95 ETDRS chart letters are being investigated). The examiner controls the test through a series of sequential control screens presented on the primary monitor: these enable collection of demographic data, control of the testing algorithm, response recording, and presentation of automatically calculated test scores.
The COMPlog testing algorithm consists of two phases, range finding and thresholding, each of which requires forced choice from the participant and input by the examiner about whether each letter is correct or incorrect. Range finding involves the presentation of sequentially smaller, single, crowded Sloan letters starting from 45 ETDRS letters (0.8 logMAR, 20/125 US Snellen) with a step size of 2 logMAR lines (10 letters). The range-finding algorithm presents increasingly larger letters in 10 ETDRS letter steps if the 45 ETDRS letter-sized optotype is incorrectly identified. Failure to correctly identify a 1 to 5 ETDRS letter score optotype results in an invitation to classify acuity using a decremental scale of counting fingers, hand movements, perception of light, and no perception of light.
In this study, range finding was followed by a series of 8 separate, sequential acuity measurement tests. A running mean acuity was calculated after each of these tests—this is referred to as the Computerized Acuity Mean Measurement (CAMM) score—with a suffixed number to identify the number of tests on which that mean was based. The CAMM3 score is therefore the average of the first 3 acuity thresholding measurements, and the CAMM8 is the average of the entire series of 8 measurements.
Each of the individual acuity measurement tests involved presentation of 3 Sloan letters at each line size with a between-line step size of 0.1 logMAR (1 ETDRS chart line). Letters were selected at random, spaced half a letter width apart, and surrounded at the same separation by a crowding box of 1 stroke width. Where letter size prevented simultaneous presentation of all 3 letters on a line, the line was broken up into 2- and 1-letter or 3 single-letter groups. No letters were repeated on any line. Subsequent thresholding repeats started 2 logMAR lines larger than the threshold of the previous test. The scoring algorithm assumes that no errors would have been made on lines larger than the first line, on which all letters are correctly identified.
The principle of full interpolation of visual acuity test scores is based on each letter being awarded a fraction of a full line score based on the number of letters presented per line and the line size increment.
8,25 There are 5 ETDRS chart letters per line, as a consequence of which each letter is awarded a score of one-fifth of a line. If a logMAR score is used, the line size increment is 0.1 logMAR, and each letter consequently receives a score of 0.02 logMAR. In this study we used number of ETDRS letters scoring in which a line is awarded a score of 5 letters, and each individual letter is awarded a score of 1. The COMplog thresholding algorithm involved presentation of 3 letters per line and a line size increment of 0.1 logMAR. Each letter in our 3-letters per line test was accordingly awarded a score of 5/3 ETDRS letters (1.67 ETDRS letters). CAMMn scores are also presented in a number of ETDRS letter formats and are rounded up to the nearest appropriate single decimal place. A CAMM6 score involves the presentation of 18 letters at each line size (3 letters per line repeated 6 times) and might therefore be awarded a score value of 5/18 ETDRS letters (0.28 ETDRS letters). As a result of this, the CAMM6 score was rounded up to
n.0,
n.3, or
n.7 ETDRS letters.
The secondary monitor used had a screen luminance of 236 cd/m2, and the contrast of letters was measured to be 99.8%. The ETDRS light box was measured to have a luminance of 111 cd/m2.
One hundred participants aged 17 to 83 years were recruited (mean age, 55.2 years). Fifty-eight were male, and the right eye was tested in 53 subjects. Twelve subjects had no ophthalmic pathology, 63 had a primary diagnosis of surgical or medical retinal disorders, 15 had glaucoma, and 6 and 4, respectively, had cataract and uveitis as the primary pathology. The median first measured ETDRS chart visual acuity of tested eyes was 68 ETDRS letters (US Snellen equivalent 20/42) (range, 24–97 ETDRS letters; US Snellen equivalents 20/12 to 20/330). The widths of the observed test-retest ranges were similar to the estimated ranges based on the assumption of normality (data not shown).
The distribution of test-retest differences appeared approximately normal for both tests (data not shown).
Figures 1 and
2 present Bland-Altman plots for the EDTRS chart measurements and CAMM6 test results, respectively. Visual inspection did not suggest that TRV was associated with the underlying acuity for either the ETDRS chart measurements or the CAMM6 test results. The mean difference between ETDRS chart scores and CAMM6 scores was −0.02 logMAR (95% CI, 0.04–0.00). This absence of any systematic bias between CAMM and ETDRS scores suggests there is no evidence that CAMM scores are systematically either higher or lower than ETDRS scores.
A scatter plot of TRV for ETDRS and CAMM6 scores against patient age is presented in
Figure 3. The respective correlation coefficients for these associations were nonsignificant (
r = 0.08 [
P = 0.43] and
r = −0.07 [
P = 0.48] for ETDRS and CAMM6, respectively).
The SD of test-retest differences for ETDRS chart measurements was 4 letters (
Table 1). For CAMM scores, the SD of test-retest differences fell with increasing numbers of repeats from 5.1 (CAMM1) to 2.9 (CAMM8); the incremental reduction in the estimated 95% TRV range is shown in
Table 1. The widths of the observed test-retest range were similar to the estimated range based on the assumption of normality (
Figs. 1,
2). Given the very limited reduction in the SD of test-retest differences seen beyond 6 repeats, CAMM6 scores were chosen for further analysis; this was based on a pragmatic tradeoff between TRV and test time (
Table 1). The absolute difference in TRV was greater for ETDRS chart measurements than for CAMM6 scores in 59 subjects; the reverse was found in 36 subjects, and no between-method difference in TRV was found in 5. The sign test
P value of 0.02 suggests that the observed TRV of CAMM6 scores was significantly less than that found for ETDRS chart measurements in our population of 100 subjects.
Table 1. Test-Retest Variability and Test Times
Table 1. Test-Retest Variability and Test Times
Acuity Test Pair | Obs | Mean Difference | SD* | Minimum | Maximum | 95% Range of TRV† | Empirical 96% Test-Retest Range | Test Time |
3rd Percentile | 98th Percentile | Mean/Median (s) | Range (s) |
ETDRS | | | | | | | | | | |
Chart | 100 | 0.03 | 3.98 | −9.0 | 10.0 | ±8 | −8.0 | 8.0 | 74/63 | 21–207 |
CAMM1 | 100 | 0.90 | 5.17 | −13.5 | 18.5 | ±11.0 | −8.5 | 10.0 | 56.2/47 | 22–186 |
CAMM2 | 100 | 1.03 | 4.03 | −9.0 | 16.5 | ±7.9 | −6.0 | 10.0 | 91.2/78 | 37–296 |
CAMM3 | 100 | 0.85 | 3.62 | −8.5 | 17.0 | ±7.1 | −6.0 | 7.0 | 127.0/108.5 | 52–388 |
CAMM4 | 100 | 0.58 | 3.28 | −9.0 | 16.0 | ±6.4 | −6.0 | 7.5 | 162.7/141 | 67–517 |
CAMM5 | 100 | 0.36 | 3.02 | −7.5 | 13.0 | ±5.9 | −6.5 | 6.0 | 197.7/168 | 80–637 |
CAMM6 | 100 | 0.34 | 2.92 | −8.0 | 11.0 | ±5.7 | −6.5 | 5.5 | 233.8/198 | 93–865 |
CAMM7 | 100 | 0.34 | 2.89 | −8.0 | 9.0 | ±5.7 | −6.5 | 5.0 | 267.1/224.5 | 103–995 |
CAMM8 | 100 | 0.28 | 2.85 | −9.5 | 9.0 | ±5.6 | −6.5 | 5.0 | 301.6/254.5 | 117–1052 |
ETDRS chart measurements in this study took a mean of 74 seconds and a median of 63 seconds (range, 21–207 seconds). CAMM6 measurements took a mean of 234 seconds and median of 198 seconds (range, 93–865 seconds;
Table 1).
The predicted specificity of ETDRS chart and CAMM6 measurements for different change criteria are presented in
Table 2. It can be seen that a change criterion of 6 letters has a specificity of 95% or better with CAMM6 but a specificity of only 87% for EDTRS. Thus, using a change criterion of 6 (for example) would result in more than twice as many false positives with EDTRS as with CAMM6. The relevant percentages are 79% and 91% for change criteria of 5 letters. If broader change criteria are used, then specificity increases with both tests such that it is at least 96% for change criteria of 8 or more letters.
Table 2. Estimated Test Specificity (Assuming True Change Is Zero) for Differing Change Criteria
Table 2. Estimated Test Specificity (Assuming True Change Is Zero) for Differing Change Criteria
Change Criterion (no. of ETDRS letters) | Test Specificity Assuming True Change Is Zero |
ETDRS Chart Specificity (%) | CAMM6 Score Specificity (%) |
5 | 79 | 91 |
6 | 87 | 96 |
7 | 92 | 98 |
8 | 96 | 99 |
9 | 98 | 100 |
10 | 99 | 100 |
11 | 99 | 100 |
12 | 100 | 100 |
13 | 100 | 100 |
14 | 100 | 100 |
15 | 100 | 100 |
The relative sensitivity to detect change with each method when specificity is set at 95% is shown in
Table 3. The sensitivity to detect a true change of 6 letters is 32% for ETDRS chart measurements and 54% for CAMM6 measurements. It can also be seen that this ratio is not constant for different degrees of true change; the sensitivity of both methods increases with increasing true change. With CAMM6 measurements, 96% sensitivity to a true change of 11 letters might be expected. With the ETDRS chart, a change of 15 letters is required to have the same level of sensitivity.
Table 3. Estimated Test Sensitivity (Assuming 95% Specificity) for Differing Change Criteria
Table 3. Estimated Test Sensitivity (Assuming 95% Specificity) for Differing Change Criteria
True Change (no. of ETDRS letters) | ETDRS Chart Sensitivity (%) | CAMM6 Score Sensitivity (%) |
5 | 24 | 40 |
6 | 32 | 54 |
7 | 42 | 67 |
8 | 52 | 78 |
9 | 62 | 87 |
10 | 71 | 93 |
11 | 79 | 96 |
12 | 85 | 98 |
13 | 90 | 99 |
14 | 94 | 100 |
15 | 96 | 100 |
The measurement error inherent in visual acuity testing is usually quantified from test-retest measurements. A 95% TRV range is calculated as 1.96 times the SD of the test-retest differences measurements, assuming that no bias between the first and second measure is found. This range is also sometimes referred to as a coefficient of repeatability.
15 Such 95% TRV ranges represent the range within which 95% of differences in repeated acuity measurements would be expected to lie when no change has occurred. Values between 3.5 and 18 ETDRS letters (0.07–0.36 logMAR) have been reported in different population groups.
2,3,5,7,8,10,15,16,25,28 It is clear from these reports that visual acuity measurements may be subject to very different levels of measurement error in different population groups. We are not aware of previous publications suggesting that age is a significant determinant of measurement error, and we found no evidence of an association between age and TRV in this study.
These estimates of TRV have frequently been interpreted as defining the minimum acuity change that might reliably be detected.
9 In fact, they define a range within which 95% of tested persons with stable acuity would be expected to lie. Thus, using the limits of this range to define cutoffs with which to identify persons whose acuity has changed ensures 95% specificity of a diagnosis of change: only 5% of persons whose acuity has not changed would be misdiagnosed as having changed (2.5% in each direction).
We have previously shown that the magnitude of visual acuity change criterion that can be detected with 95% sensitivity is 1.84 times the TRV.
27 Reducing acuity test measurement error will reduce the TRV range and increase our ability to detect visual acuity changes of any given size while retaining high specificity. Inherent in this is the proviso that any change criterion should be specified based on the likely TRV in the population under scrutiny. We are only aware of two previous studies in which the sensitivity as opposed to specificity of acuity measurement has been investigated.
16,27
The hypothesis of this study was that a process of repeating and averaging visual acuity measurements would result in reduced measurement error that, in turn, could result in quantifiable benefits in terms of test sensitivity to detect true changes of a set magnitude.
To date there is only 1 publication on the effect of repeating and averaging visual acuity measurements on TRV.
17 In this pilot study, it was found that averaging the scores of 5 or 10 computerized thresholding measurements resulted in 95% TRV ranges of ±5.5 and 5 ETDRS letters, respectively. The 95% TRV range of ETDRS chart measurements in the studied population was ±9 letters. The computerized test device used in that study was limited to acuities of 60 ETDRS letters or better (0.50 logMAR or 20/63), and only 19 participants were included.
17
The 95% TRV range of gold standard ETDRS chart single-letter scoring measurements in our study population was ±8 letters. To assess the generalizability of our results, TRV values between 3.5 and 18 ETDRS letters (0.07–0.36 logMAR) have been reported in different population groups.
2,3,5,7,8,10,15,16,25,28 Patient age has not been recognized as a predictor of TRV, and we were unable to find any significant association between patient age and TRV in our study population (
Fig. 3). Use of a CAMM6 score reduced the 95% TRV range to ±5.7 letters (
Table 1). This reduction in TRV was statistically significant, with a sample size of 100 subjects (sign test,
P = 0.02). A similar improvement in TRV might conceivably be obtained by averaging the results of multiple scores obtained using either test charts or other computerized acuity measurement devices.
4,20 We conclude that a process of repeating and averaging can result in a reduced 95% TRV range of visual acuity measurements.
A need for sensitivity to detect true acuity change is common to all ophthalmic disciplines. False-positive and false-negative errors have important clinical and economic consequences by prompting unnecessary treatment or resulting in the withholding of indicated therapy. The setting of rational change criteria is a titration of the cost and clinical consequences of false-positive and false-negative errors. If 95% specificity was required in our population when using an ETDRS chart, then a sensitivity of 95% is only achieved with true changes of 15 letters or more (
Table 3). For the CAMM6 measurement, this is reduced to 10.7 letters. If 80% sensitivity is deemed acceptable, then a 12-letter change is required with an ETDRS chart compared with 8.4 letters if a CAMM6 score is used.
To put the economic implications of this result into context, the EMEA has listed a change criterion of a worsening of 6 ETDRS letters as a criterion for the retreatment of nARMD. A worsening of 5 ETDRS letters was a retreatment criterion in the PRONTO study of ranibizumab for nARMD. The selection of such acuity change criteria does not appear to have been based on known test sensitivity and specificity. The current net unit dose cost of ranibizumab in the European Union is approximately US $1200 (€865, £761.20; October 2010 Exchange Rates), and treatment with this drug also involves administration costs.
13 The false-positive rates in this study using change criteria of 5 and 6 letters were, respectively, 21% and 13% when an ETDRS chart was used compared with only 9% and 4% when a CAMM6 score was measured (
Table 2).
Equally important was the finding that using these same change criteria, only 24% and 32% of patients, respectively, would be expected to be correctly classified as changing when the ETDRS chart was used.
27 The sensitivity of CAMM6 measurements in this regard would be expected to be 40% and 54%, respectively (
Table 3). It can be seen from this that the use of computerized averaging of acuity scores may improve the sensitivity and specificity of acuity change classification. Clinical change criteria should also be set based on an estimate of the sensitivity and specificity of the acuity test used in classifying such change.
This reduced TRV and the consequent improvements in sensitivity for a fixed specificity were, however, obtained at a cost in terms of testing times. The mean ETDRS chart measurement time was 74 seconds (range, 21–207 seconds; median, 63 seconds), whereas a CAMM6 measurement took a mean of 234 seconds (range, 93–865 seconds; median, 198 seconds). The decision to adopt computerized averaging of acuity scores is based on a titration of the increased acuity measurement time against the potential clinical and economic benefits arising from improved test specificity and sensitivity.
There were limitations in the statistical calculations used in this study. The calculated sensitivities were determined based on the change criterion set at 95% TRV. Furthermore, modeling of sensitivity was based on the assumption that the acuity measurements represented true underlying acuities plus or minus random error, which were independent. In reality, there may be a number of nonrandom errors in clinical practice.
27
A follow-up study involving patients with confirmed visual acuity change will be performed to validate this model.
Disclosure:
N. Shah, None;
D.A.H. Laidlaw, P;
S.P. Shah, None;
S. Sivasubramaniam, None;
C. Bunce, None;
S. Cousens, None