August 2010
Volume 51, Issue 8
Free
Clinical and Epidemiologic Research  |   August 2010
The Use of Best Visual Acuity over Several Encounters as an Outcome Variable: An Analysis of Systematic Bias
Author Affiliations & Notes
  • Dara Koozekanani
    From the Department of Ophthalmology, University of Minnesota, Minneapolis, Minnesota; and
  • Douglas J. Covert
    the Department of Ophthalmology, Medical College of Wisconsin, Milwaukee, Wisconsin.
  • David V. Weinberg
    the Department of Ophthalmology, Medical College of Wisconsin, Milwaukee, Wisconsin.
  • Corresponding author: David V. Weinberg, 925 N. 87th Street, Milwaukee, WI 53226; dweinber@mcw.edu
Investigative Ophthalmology & Visual Science August 2010, Vol.51, 3909-3912. doi:https://doi.org/10.1167/iovs.09-4643
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Dara Koozekanani, Douglas J. Covert, David V. Weinberg; The Use of Best Visual Acuity over Several Encounters as an Outcome Variable: An Analysis of Systematic Bias. Invest. Ophthalmol. Vis. Sci. 2010;51(8):3909-3912. https://doi.org/10.1167/iovs.09-4643.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose.: To investigate whether the use of the best of multiple measures of visual acuity as an endpoint introduces bias into study results.

Methods.: Mathematical models and Monte Carlo simulations were used. A model was designed in which a hypothetical intervention did not influence the visual acuity. The best of one or more postintervention measures was used as the outcome variable and was compared to the baseline measure. Random test–retest variability was included in the model.

Results.: When the better of two postintervention measures was used as the outcome variable with a sample size of 25, the model falsely rejected the null hypothesis 55% of the time. When the best of three measures was used, the false-positive rate increased to 90%. The probability of falsely rejecting the null hypothesis increased with increasing sample size, also increasing the number of measures used to select the outcome variable.

Conclusions.: Using the best of multiple measures as an outcome variable introduces a systematic bias resulting in false conclusions of improvement in that variable. The use of best of multiple measures of visual acuity as an outcome variable should be avoided.

Visual acuity is a common outcome variable in clinical ophthalmology research. DiLoreto et al. 1 reviewed 1 year's worth of publications from three major American ophthalmology journals looking for articles in which best and/or final visual acuity was used as an outcome measure. “Best” vision refers to the practice of selecting the maximum vision among multiple measurements. Among articles reporting visual acuity as an outcome measure, best vision was used in 3.6%. Most of those were small studies. 1 Best vision is most commonly used as an outcome in retrospective studies exploring the effect of an intervention. Vision measured before the intervention is used as the baseline variable, and the best of several postintervention measurements is used as the outcome variable. DiLoreto et al. 1 discouraged the use of best vision as an outcome, because of the possibility of introducing bias and indicated that this bias could lead to underestimation or overestimation of the visual outcome, depending on the trend in visual acuity change over time. 
Studies have consistently shown that even under ideal testing conditions, there is test–retest variability in the measurement of visual acuity. 25 We hypothesized that this variability would introduce bias that could cause erroneous conclusions when using best vision as an outcome measure. We explored the nature of this bias, using both theoretical calculations and numerical simulations. 
Methods
We first derived theoretical calculations of the probability distribution function produced by selecting the largest value of multiple random measurements. We assumed that the measurements had a normal distribution. Repeated visual acuity measurements, at least when performed with a logMAR chart, approximate a normal distribution. 2,4,5 These calculations were then verified by using a Monte Carlo technique to simulate repetition of this experiment millions of times and to compare the theoretically calculated with the simulated results. 
We also created a hypothetical experimental model in which the visual acuity was unchanged by an intervention. A single, baseline acuity was measured and compared to the best of one or more acuity measures after the intervention. 
Part I: Calculation of Probability Distribution Functions
We first present the theoretical calculations of the probability distribution functions. Let the random variables v 1, … , vn represent n repeated measurements of visual acuity. Assume each measurement vi is a random selection from a normal distribution (representing the normal variability in repeat testing of visual acuity). For simplicity, assume the mean and SD are 0 and 1, respectively. The probability distribution function (pdf) for vi , defined as pi , is defined as:    
Then the probability that any given vi is less than or equal to some value x, denoted Pi (x), is defined as the cumulative distribution function. Thus Pi (x) is given by    
Now consider V max, the maximum of the n acuity measurements. The probability that V maxx, denotes P max(x), is the probability that vi is less than x for every vi . If the individual vi values are distributed independently, then it is the product of the probabilities that each individual vi is less than x. Because they are distributed identically, it is equivalent to this probability for any vi raised to the nth power. That is,    
Now Pmax is the cumulative distribution function for Vmax, the maximum of n acuity measurements. We are interested in the probability distribution function for Vmax, denoted pmax. This function is simply the derivative of Pmax with respect to x. That is,   By the fundamental theorem of calculus, we have:  Pmax(x) was numerically integrated to calculate its mean and SD. 
To verify these theoretical results, we performed Monte Carlo simulations of the experiment just described. In a Monte Carlo simulation, a computer uses a random-number generator to simulate running a large number of independent iterations of an experiment. The results are then tallied to measure the frequency of various outcomes. To run these simulations, we used a commercial software package (MatLab; MathWorks, Natick, MA). Vision was considered a continuous variable, and the distribution of test–retest error was assumed to be normally distributed with mean and SD of 0 and 1, respectively. For each iteration, a random value was drawn from the test–retest distribution and assigned as the baseline vision. We then simulated taking the best of 2, 3, or 4 subsequent independent vision measures, using the same normal distribution for test–retest variation. For each experiment, we ran 10 million iterations; from these results, we could form a numerical estimate of p max(x), and calculate its mean and SD. This estimate was then compared to the theoretical calculation of p max(x). 
Part II: Simulated Experiments and Statistical Power
To evaluate the consequences of these probability distributions on experimental results, we analyzed hypothetical experimental designs. We created an experimental model in which the visual acuity was unchanged by a hypothetical intervention. A single, baseline acuity was randomly selected from the normal distribution. The outcome variable was the greatest of one or more random selections from the normal distribution. We then calculated the likelihood that statistical testing would detect a significant difference between the baseline and outcome visual acuities for various sample sizes. 
As in part I, we first used a theoretical approach by calculating the power of a paired t-test using our experimental design. In general, a statistical test's power is the probability that a statistically significant difference will be detected if one truly exists. For the preintervention vision, we assumed that the acuity was normally distributed with a mean of 0 and an SD of 1. For the postintervention vision, we used the probability distribution functions for the best of multiple random samples, as derived in part I. We set the level of statistical significance (type I error) at 0.05, as is common. The statistical powers were calculated by using the freely distributed software package G*power 3, as described by Faul et al. 6  
We then ran Monte Carlo simulations to numerically evaluate the outcomes. We created models of the experiments. Monte Carlo simulations were run with a minimum of 10 million iterations for each combination of sample size and outcome sampling rate. Paired t-tests with a critical P ≤ 0.05 were used to compare the baseline and outcome visual acuities. Summary statistics from the simulations were calculated (MatLab; MathWorks Inc., Natick, MA; or Excel; Microsoft Inc., Redmond, WA). 
Results
Table 1 shows the baseline theoretical standard distribution with mean of 0 and a SD of 1, and plots of the probability distribution functions obtained from the maximum of 1, 2, 3, or 4 samples drawn from the same distribution. The case for 1 sample is a control trial and shows that the Monte Carlo simulation gives the expected result. The distribution plots diverge further from the original population mean as one takes the best of a larger number of measurements. The plots from the theoretical and Monte Carlo methods were identical. The mean and SD as calculated from the theoretical p max(x) and the Monte Carlo simulations are also presented. They are in close agreement. 
Table 1.
 
Probability Distributions Resulting from Choosing the Maximum of Multiple Measures of Visual Acuity
Table 1.
 
Probability Distributions Resulting from Choosing the Maximum of Multiple Measures of Visual Acuity
  Probability Distributions Resulting from Choosing the Maximum of Multiple Measures of Visual Acuity
Table 2 shows the likelihood that a paired t-test would detect a difference between pre- and postintervention populations in the hypothetical experiment described earlier. This likelihood of rejecting the null hypothesis is presented for different combinations of sample size and variations of the best acuity measurement. We present the results from the theoretical study power calculations, as well as the results of the Monte Carlo simulations. As expected, if the outcome variable is based on a single measurement, the likelihood of rejecting the null hypothesis approximates 5% when the usual definition of statistical significance is used (critical P ≤ 0.05). When the outcome variable was the best of two repeated measures and with a sample population of 10, a paired t-test concluded that there was a difference 24% of the time. As sampling rate and sample size increase, the likelihood of finding a false-positive outcome increases sharply. 
Table 2.
 
Probability of Rejecting the Null Hypothesis for Various Study Conditions
Table 2.
 
Probability of Rejecting the Null Hypothesis for Various Study Conditions
Number of Measures from Which Maximum Value Was Selected
1 2 3 4
Sample Size (n) Sample Size (n) Sample Size (n) Sample Size (n)
10 25 50 10 25 50 10 25 50 10 25 50
Theoretical study power, α 0.05 0.05 0.05 0.23 0.55 0.85 0.48 0.90 1.00 0.66 0.98 1.00
Proportion of simulations rejecting the null hypothesis 0.05 0.06 0.05 0.24 0.56 0.87 0.48 0.91 1.00 0.65 0.98 1.00
Discussion
It is not uncommon in the literature to find studies in which a single, preintervention, visual acuity is compared to the best of several postintervention visual acuities. 1 However, as several studies have shown, there is variability in the repeated measurement of visual acuity, even under ideal testing conditions. 25 We investigated the statistical bias this practice would induce, using both theoretical and Monte Carlo models. We achieved the same results with both techniques, helping to validate our conclusions. Our models reflect a situation in which baseline vision is based on a single measure, and outcome vision is based on the maximum of two or more measures from the same distribution. The systematic selection of the best of multiple measures introduces the false impression of improvement from baseline when none truly exists. This is driven by the asymmetric selection of the “best” result among multiple measure of a variable subject to test–retest variability. 
In our model, the probability of rejecting the null hypothesis and erroneously concluding a treatment benefit became high, even with modest sample sizes and a low number of postintervention measures. Similarly, in a real experiment, true treatment benefits could be exaggerated. In fact, analysis of an intervention with a harmful effect on acuity could be sufficiently biased to conclude a beneficial effect. The assumptions of our model may differ from those of any specific clinical situation. However, if there is any variability in repeatedly testing vision, using the best of multiple tests will always bias toward improved vision. The likelihood of finding a false-positive outcome increases as one increases the number of subjects and the number of postintervention measures. 
We have ignored patient-to-patient variability in our model. We have done this for simplicity, since this does not influence the conclusions of our simulations. We used a paired t-test in our model. The null hypothesis is that the average change in vision for the sample is 0. This analysis first calculates the difference between the outcome and baseline vision for each subject. Since the difference in acuity is calculated for each subject, baseline intersubject differences do not influence the analysis. Under these assumptions, the results of our models would be the same whether individual study subjects entered with similar or different visual acuities. 
These results are not specific to visual acuity, but would apply to any measure for which there is random test–retest variability. We chose a normal distribution for ease of visualization and because logMAR visual acuities do appear to have a normal test–retest distribution. However, the derivation of p max(x) did not require a normal distribution, and, a similar equation could be derived for any initial probability distribution. One might observe that a paired t-test requires both test populations to have normal distributions. The distribution of a best acuity measure is not normal. We do not feel this affects our overall conclusions, however. Although the probability distribution functions for the best of multiple measures are not truly normal, they probably approximate normality more closely than most clinical data sets. The t-test is fairly robust to non-normality. Moreover, the use of a t-test reflects typical practice in the smaller studies that are more likely to use a best visual acuity measure. 
This model is illustrative of the bias introduced by using the best among two or more measures of visual acuity as an outcome in clinical research; however, there are limitations. The model does not account for many factors likely to be present in real clinical data. Among these factors would be drift in vision over time due to natural history and “floor” and “ceiling” limits at the extremes of visual acuity. Thus, the magnitude (but not the direction) of the bias will vary with real-world data. 
Using the maximum of multiple measures of visual acuity as an outcome variable introduces unacceptable bias favoring the false conclusion of improvement in the variable. Authors, journal editors, and peer reviewers should be aware of this bias so that the methodology can be avoided. 
Footnotes
 Supported in part by an unrestricted grant from Research to Prevent Blindness, Inc., New York, New York.
Footnotes
 Disclosure: D. Koozekanani, None; D.J. Covert, None; D.V. Weinberg, None
References
DiLoreto DA Bressler NM Bressler SB Schachat AP . Use of best and final visual acuity outcomes in ophthalmological research. Arch Ophthalmol. 2003;121:1586–1590. [CrossRef] [PubMed]
Arditi A Cagenello R . On the statistical reliability of letter-chart visual acuity measurements. Invest Ophthalmol Vis Sci. 1993;34:120–129. [PubMed]
Blackhurst DW Maguire MG . Reproducibility of refraction and visual acuity measurement under a standard protocol. The Macular Photocoagulation Study Group. Retina. 1989;9:163–169. [CrossRef] [PubMed]
Lovie-Kitchin JE . Validity and reliability of visual acuity measurements. Ophthalmic Physiol Opt. 1988;8:363–370. [CrossRef] [PubMed]
Siderov J Tiu AL . Variability of measurements of visual acuity in a large eye clinic. Acta Ophthalmol Scand. 1999;77:673–676. [CrossRef] [PubMed]
Faul F Erdfelder E Lang AG Buchner A . G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Methods. 2007;39:175–191. [CrossRef] [PubMed]
Table 1.
 
Probability Distributions Resulting from Choosing the Maximum of Multiple Measures of Visual Acuity
Table 1.
 
Probability Distributions Resulting from Choosing the Maximum of Multiple Measures of Visual Acuity
  Probability Distributions Resulting from Choosing the Maximum of Multiple Measures of Visual Acuity
Table 2.
 
Probability of Rejecting the Null Hypothesis for Various Study Conditions
Table 2.
 
Probability of Rejecting the Null Hypothesis for Various Study Conditions
Number of Measures from Which Maximum Value Was Selected
1 2 3 4
Sample Size (n) Sample Size (n) Sample Size (n) Sample Size (n)
10 25 50 10 25 50 10 25 50 10 25 50
Theoretical study power, α 0.05 0.05 0.05 0.23 0.55 0.85 0.48 0.90 1.00 0.66 0.98 1.00
Proportion of simulations rejecting the null hypothesis 0.05 0.06 0.05 0.24 0.56 0.87 0.48 0.91 1.00 0.65 0.98 1.00
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×