**Purpose.**:
To investigate whether the use of the best of multiple measures of visual acuity as an endpoint introduces bias into study results.

**Methods.**:
Mathematical models and Monte Carlo simulations were used. A model was designed in which a hypothetical intervention did not influence the visual acuity. The best of one or more postintervention measures was used as the outcome variable and was compared to the baseline measure. Random test–retest variability was included in the model.

**Results.**:
When the better of two postintervention measures was used as the outcome variable with a sample size of 25, the model falsely rejected the null hypothesis 55% of the time. When the best of three measures was used, the false-positive rate increased to 90%. The probability of falsely rejecting the null hypothesis increased with increasing sample size, also increasing the number of measures used to select the outcome variable.

**Conclusions.**:
Using the best of multiple measures as an outcome variable introduces a systematic bias resulting in false conclusions of improvement in that variable. The use of best of multiple measures of visual acuity as an outcome variable should be avoided.

^{ 1 }reviewed 1 year's worth of publications from three major American ophthalmology journals looking for articles in which best and/or final visual acuity was used as an outcome measure. “Best” vision refers to the practice of selecting the maximum vision among multiple measurements. Among articles reporting visual acuity as an outcome measure, best vision was used in 3.6%. Most of those were small studies.

^{ 1 }Best vision is most commonly used as an outcome in retrospective studies exploring the effect of an intervention. Vision measured before the intervention is used as the baseline variable, and the best of several postintervention measurements is used as the outcome variable. DiLoreto et al.

^{ 1 }discouraged the use of best vision as an outcome, because of the possibility of introducing bias and indicated that this bias could lead to underestimation or overestimation of the visual outcome, depending on the trend in visual acuity change over time.

^{ 2–5 }We hypothesized that this variability would introduce bias that could cause erroneous conclusions when using best vision as an outcome measure. We explored the nature of this bias, using both theoretical calculations and numerical simulations.

^{ 2,4,5 }These calculations were then verified by using a Monte Carlo technique to simulate repetition of this experiment millions of times and to compare the theoretically calculated with the simulated results.

*v*

_{1}, … ,

*v*represent

_{n}*n*repeated measurements of visual acuity. Assume each measurement

*v*is a random selection from a normal distribution (representing the normal variability in repeat testing of visual acuity). For simplicity, assume the mean and SD are 0 and 1, respectively. The probability distribution function (pdf) for

_{i}*v*, defined as

_{i}*p*, is defined as:

_{i}*V*

_{max}, the maximum of the

*n*acuity measurements. The probability that

*V*

_{max}≤

*x*, denotes

*P*

_{max}(

*x*), is the probability that

*v*is less than

_{i}*x*for every

*v*. If the individual

_{i}*v*values are distributed independently, then it is the product of the probabilities that each individual

_{i}*v*is less than

_{i}*x*. Because they are distributed identically, it is equivalent to this probability for any

*v*raised to the

_{i}*n*th power. That is,

*P*

_{max}is the cumulative distribution function for

*V*

_{max}, the maximum of

*n*acuity measurements. We are interested in the probability distribution function for

*V*

_{max}, denoted

*p*

_{max}. This function is simply the derivative of

*P*

_{max}with respect to

*x*. That is, By the fundamental theorem of calculus, we have:

*P*

_{max}(

*x*) was numerically integrated to calculate its mean and SD.

*p*

_{max}(

*x*), and calculate its mean and SD. This estimate was then compared to the theoretical calculation of

*p*

_{max}(

*x*).

*t*-test using our experimental design. In general, a statistical test's power is the probability that a statistically significant difference will be detected if one truly exists. For the preintervention vision, we assumed that the acuity was normally distributed with a mean of 0 and an SD of 1. For the postintervention vision, we used the probability distribution functions for the best of multiple random samples, as derived in part I. We set the level of statistical significance (type I error) at 0.05, as is common. The statistical powers were calculated by using the freely distributed software package G*power 3, as described by Faul et al.

^{ 6 }

*t*-tests with a critical

*P*≤ 0.05 were used to compare the baseline and outcome visual acuities. Summary statistics from the simulations were calculated (MatLab; MathWorks Inc., Natick, MA; or Excel; Microsoft Inc., Redmond, WA).

*p*

_{max}(

*x*) and the Monte Carlo simulations are also presented. They are in close agreement.

*t*-test would detect a difference between pre- and postintervention populations in the hypothetical experiment described earlier. This likelihood of rejecting the null hypothesis is presented for different combinations of sample size and variations of the best acuity measurement. We present the results from the theoretical study power calculations, as well as the results of the Monte Carlo simulations. As expected, if the outcome variable is based on a single measurement, the likelihood of rejecting the null hypothesis approximates 5% when the usual definition of statistical significance is used (critical

*P*≤ 0.05). When the outcome variable was the best of two repeated measures and with a sample population of 10, a paired

*t*-test concluded that there was a difference 24% of the time. As sampling rate and sample size increase, the likelihood of finding a false-positive outcome increases sharply.

Number of Measures from Which Maximum Value Was Selected | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 2 | 3 | 4 | |||||||||

Sample Size (n) | Sample Size (n) | Sample Size (n) | Sample Size (n) | |||||||||

10 | 25 | 50 | 10 | 25 | 50 | 10 | 25 | 50 | 10 | 25 | 50 | |

Theoretical study power, α | 0.05 | 0.05 | 0.05 | 0.23 | 0.55 | 0.85 | 0.48 | 0.90 | 1.00 | 0.66 | 0.98 | 1.00 |

Proportion of simulations rejecting the null hypothesis | 0.05 | 0.06 | 0.05 | 0.24 | 0.56 | 0.87 | 0.48 | 0.91 | 1.00 | 0.65 | 0.98 | 1.00 |

*t*-test with critical

*P*< 0.05 was used to determine statistical significance.

^{ 1 }However, as several studies have shown, there is variability in the repeated measurement of visual acuity, even under ideal testing conditions.

^{ 2–5 }We investigated the statistical bias this practice would induce, using both theoretical and Monte Carlo models. We achieved the same results with both techniques, helping to validate our conclusions. Our models reflect a situation in which baseline vision is based on a single measure, and outcome vision is based on the maximum of two or more measures from the same distribution. The systematic selection of the best of multiple measures introduces the false impression of improvement from baseline when none truly exists. This is driven by the asymmetric selection of the “best” result among multiple measure of a variable subject to test–retest variability.

*t*-test in our model. The null hypothesis is that the average change in vision for the sample is 0. This analysis first calculates the difference between the outcome and baseline vision for each subject. Since the difference in acuity is calculated for each subject, baseline intersubject differences do not influence the analysis. Under these assumptions, the results of our models would be the same whether individual study subjects entered with similar or different visual acuities.

*p*

_{max}(

*x*) did not require a normal distribution, and, a similar equation could be derived for any initial probability distribution. One might observe that a paired

*t*-test requires both test populations to have normal distributions. The distribution of a best acuity measure is not normal. We do not feel this affects our overall conclusions, however. Although the probability distribution functions for the best of multiple measures are not truly normal, they probably approximate normality more closely than most clinical data sets. The

*t*-test is fairly robust to non-normality. Moreover, the use of a

*t*-test reflects typical practice in the smaller studies that are more likely to use a best visual acuity measure.