**Purpose.**:
The probability of type I error, or a false-positive result, increases as the number of statistical comparisons in a study increases. Statisticians have developed numerous corrections to account for the multiple comparison problem. This study discusses recent guidelines involving multiple comparison corrections, calculates the prevalence of corrections in ophthalmic research, and estimates the corresponding number of false-positive results reported at a recent international research meeting.

**Methods.**:
The 6415 abstracts presented at ARVO 2010 were searched for statistical comparisons (*P* values) and for use of multiple comparison corrections. Studies that reported five or more *P* values while reporting no correction factor were used in a simulation study. The simulation study was conducted to estimate the number of false-positive results reported in these studies.

**Results.**:
Overall, 36% of abstracts reported *P* values and 1.2% of abstracts used some form of correction. Whereas 8% of abstracts reported at least five *P* values, only 5% of these used a multiple comparison correction. In these highly statistical studies, simulations resulted in 185 false-positive outcomes found in 30% of abstracts.

**Conclusions.**:
The paucity of multiple comparison corrections in ophthalmic research results in inflated type I error and may produce unwarranted shifts in clinical or surgical care. Researchers must make a conscious effort to decide if and when to use a correction factor to ensure the validity of the data.

^{1–3}Prior to beginning an analysis, researchers must agree on an acceptable type I error rate, or alpha level. When more than one significance test is performed in a study, the type I error rate for each individual test remains equal to the alpha level; however, the probability of obtaining at least one false-positive result in the study as a whole increases. This is known as the multiple comparison problem or the multiple testing problem. To illustrate this phenomenon, consider a standard coin flip. Each time a coin is flipped there is a 50% chance of the coin landing on “heads.” Now consider flipping the same coin 10 times. Each individual flip still results in a 50% chance of “heads.” However, the probability of obtaining at least one “head” among all 10 coin flips is much larger than 50%. The same phenomenon occurs when 10 separate tests are performed using an alpha level of 0.05; the probability of obtaining at least one false-positive result out of all 10 individual tests is larger than 5%.

*FWER*= 1 – 0.95

*, where*

^{n}*n*is the total number of comparisons made in a study. This equation is represented graphically in Figure 1, which demonstrates how FWER is related to the number of comparisons in a study and the predetermined alpha level.

**Figure 1.**

**Figure 1.**

^{4–12}Although there is still ongoing discussion about the more esoteric points of the argument, many researchers and international organizations agree that multiplicity corrections must be used to rein in type I error.

^{13–24}In fact, a number of recent studies have identified the lack of multiple comparison corrections to be the underlying cause of unwarranted shifts in clinical care paradigms.

^{13,25,26}Although reported guidelines for use vary, most sources agree that: (1) the multiple comparisons problem should not be ignored or type I error inflation can occur; (2) the best way to address the problem is to limit the number of comparisons; (3) rationale for and against using a correction factor should be discussed before data analysis is undertaken and should be properly documented; and (4) corrections are strongly encouraged when separate comparisons are related or when a study is confirmatory in nature. The difference between “exploratory” and “confirmatory” analysis, often described as inductive versus deductive research, is not finite and also must be discussed during study design.

^{27–29}However, an analysis of the prevalence of multiple comparison corrections in ophthalmic research and its implications has not been addressed. In this study the prevalence of multiplicity corrections in ophthalmic research is estimated using abstracts at an international research conference. The analysis focuses on studies that report large numbers of statistical comparisons, because these represent research where multiple comparison corrections would need to be considered. Simulation techniques are used to estimate the number of type I errors reported in these statistically rigorous abstracts.

**Table 1.**

**Table 1.**

Total # of Abstracts | # Reporting P Values | % Reporting P Values | Max # P Values Reported | Median # of P Values (where reported) | # of Abstracts Reporting >5 P Values | # of Abstracts Reporting >10 P Values | |

Anatomy | 225 | 68 | 30% | 50 | 2 | 12 | 2 |

Biochemistry | 562 | 124 | 22% | 20 | 3 | 33 | 10 |

Clinical epidemiology | 348 | 169 | 49% | 14 | 3 | 49 | 5 |

Cornea | 878 | 321 | 37% | 1,000,000 | 3 | 56 | 8 |

Eye movements | 272 | 97 | 36% | 29 | 3 | 19 | 6 |

Genetics | 48 | 6 | 13% | 100,000 | 2.5 | 2 | 2 |

Glaucoma | 752 | 459 | 61% | 100 | 3 | 135 | 31 |

Immunology | 362 | 88 | 24% | 12 | 2 | 11 | 2 |

Lens | 240 | 46 | 19% | 12 | 3 | 11 | 2 |

Multidisciplinary | 177 | 51 | 29% | 21 | 2 | 10 | 1 |

Nanotechnology | 25 | 6 | 24% | 2 | 1 | 0 | 0 |

Physiology | 312 | 112 | 36% | 24 | 3 | 25 | 6 |

Retina | 1076 | 463 | 43% | 20 | 2 | 112 | 14 |

Retinal cell biology | 600 | 163 | 27% | 100,000 | 2 | 27 | 2 |

Visual neurology | 265 | 48 | 18% | 10 | 3 | 8 | 1 |

Visual psychology | 273 | 100 | 37% | 9 | 2.5 | 28 | 0 |

Total | 6415 | 2321 | 36% | 1,000,000 | 3 | 538 | 92 |

*P*values. The PDF document was searched for the terms, “

*P*value,” “

*P*,” “

*P*,” “

*P*,” and all spatial variations of the same. All abstracts were also searched for the most common multiple comparison correction methods using the terms “Bonferroni,” “Scheffe,” “Tukey,” “Duncan,” “Dunnett,” “Newman-Keuls,” “Sidak,” “Least Significant Difference,” “False Discovery Rate,” as well as the general terms “multiple comparison” and “multiplicity.” The search was automated, highlighting all the terms listed above. After the automated search was complete, two of the authors (AS and SP) and two assistants conducted a manual review of the search results, assessed the results for validity, and recorded two variables for each abstract: the number of reported

*P*values and whether a correction factor was used.

*P*values (FWER of 23% or greater) and 10 or more

*P*values (FWER of 40% or greater), were analyzed for their use of a correction factor. If a correction factor was not mentioned, the abstracts were used in a simulation study. The goal of the simulation study was to estimate the number of type I errors expected in these statistically rigorous studies. Criteria for inclusion in the simulation were 5 or more reported

*P*values and no reported correction factor. For each abstract that met inclusion criteria, a binomial distribution was used to simulate the number of type I errors reported in the abstract using the number of reported

*P*values as the “number of observations” parameter and an assumed alpha level of 0.05 as the “success” parameter. The simulation parameters can be written as

*Y*(

_{i}∼ BINOMIAL*n*), where

_{i}, p*n*is the number of reported

_{i}*P*values in the

*i*th abstract,

*P*equals the alpha level (0.05) or the probability of type I error, and

*Y*is the resulting number of simulated type I errors in the

_{i}*i*th abstract. Because the null hypothesis was unknown in all cases, it was assumed to be true for all statistical comparisons. One simulation was complete when the resulting number of type I errors for each abstract was estimated using the above distribution. At the end of one simulation, results were recorded including: the total number of type I errors in all studies, the number of simulated studies with type I errors, and the number of simulated studies with more than one type I error. This process was repeated 10,000 times and the average results were calculated. A separate simulation study was carried out for all abstracts with 5 or more

*P*values, and all abstracts with 10 or more

*P*values. The simulation study was completed using the R software (GNU Project) statistical package (provided in the public domain by the R Foundation for Statistical Computing, Vienna, Austria, available at http://www.r-project.org/).

^{30}

*P*values in these abstracts, separated by category. A total of 36% of all abstracts (2321) reported statistical comparisons in the form of

*P*values. Researchers in glaucoma registered the highest percentage of abstracts reporting

*P*values (61%), whereas those in genetics reported the lowest percentage (13%). Overall, 23% (538) of those abstracts that reported

*P*values presented >5 and 4% (92) reported >10

*P*values.

*P*values. The most common correction method used was a Bonferroni correction, which represented 32% of all corrections. Researchers also used Tukey's (28%), False Discovery Rate (7%), Least Significant Difference (5%), Dunnett's (4%), Scheffe's (3%), and Newman-Keul's (3%) methods. A nonspecific multiple comparison test was used in 13 (18%) abstracts. The Duncan or Sidak methods were not used. The abstracts within the genetics section, which reported the lowest prevalence of

*P*values, demonstrated the most proficient use of multiple comparison corrections with 8.3% of all genetics abstracts reporting some form of correction. Of the abstracts that reported at least 5

*P*values, only 5% (27 of 538) reported a correction factor. In the 511 abstracts with at least 5

*P*values and no correction factor, there were a total of 3703 reported

*P*values (per-abstract mean = 7.2, median = 6, max = 44). Of the abstracts that reported at least 10

*P*values, only 13% (12 of 92) reported a correction factor. In the 80 abstracts with at least 10

*P*values and no correction factor, there were a total of 1054 reported

*P*values (per-abstract mean = 13.2, median = 11, max = 44).

**Table 2.**

**Table 2.**

Bonferroni | Tukey | False Discovery Rate | Least Significant Difference | Dunnett | Scheffe | Newman- Keuls | Multiple Comparison NOS | Total | % of All Abstracts | % of All Abstracts Reporting P Values | |

Anatomy | - | - | 2 | - | - | - | - | 2 | 4 | 1.8% | 5.9% |

Biochemistry | 2 | - | 1 | - | - | - | 1 | 1 | 5 | 0.9% | 4% |

Clinical epidemiology | - | 1 | - | - | - | - | - | - | 1 | 0.3% | 0.6% |

Cornea | 7 | 1 | - | 2 | - | 1 | - | 4 | 15 | 1.7% | 4.7% |

Eye movements | 1 | 3 | - | - | - | - | - | 1 | 5 | 1.8% | 5.2% |

Genetics | 3 | - | - | - | - | - | - | 1 | 4 | 8.3% | 66.7% |

Glaucoma | 4 | 1 | 1 | 2 | 1 | 1 | 1 | 3 | 14 | 1.9% | 3.1% |

Immunology | - | - | - | - | - | - | - | 1 | 1 | 0.3% | 1.1% |

Lens | - | 3 | - | - | - | - | - | - | 3 | 1.3% | 6.5% |

Multidisciplinary | 1 | - | - | - | - | - | - | - | 1 | 0.6% | 2% |

Nanotechnology | - | - | - | - | - | - | - | - | 0 | 0% | 0% |

Physiology | - | 2 | - | - | 1 | - | - | - | 3 | 1% | 2.7% |

Retina | 3 | 4 | - | - | - | - | - | - | 7 | 0.7% | 1.5% |

Retinal cell biology | 1 | 1 | - | - | - | - | - | - | 2 | 0.3% | 1.2% |

Visual neurology | - | 1 | 1 | - | 1 | - | - | - | 3 | 1.1% | 6.3% |

Visual psychology | 2 | 4 | - | - | - | - | - | - | 6 | 2.2% | 6% |

Total | 24 | 21 | 5 | 4 | 3 | 2 | 2 | 13 | 74 | 1.2% | 3.2% |

*P*values. A total of 80 abstracts met criteria for the simulation study involving 10 or more

*P*values. The characteristics of the studies that met inclusion criteria and the results of the simulation study are displayed in Table 3. The simulation study resulted in a false-positive outcome in an average of 30% (154 of 511) of abstracts reporting 5 or more

*P*values and in nearly half (48%, 38 of 80) of abstracts reporting 10 or more

*P*values. In addition, multiple type I errors were found in an average of 5.2% of studies with 5 or more comparisons and 14% of studies with 10 or more.

**Table 3.**

**Table 3.**

Abstract Characteristics | Simulation Results | |||||

# of Reported P Values | # Abstracts Meeting Criteria | Total # of P Values in Included Studies | Average # of Simulated Type I Errors | Average # of Simulated Studies with a Type I Error | % of Simulated Studies with a Type I Error | % of Studies with Multiple Type I Errors |

5 or more | 511 | 3703 | 185.3 | 154.2 | 30.20% | 5.20% |

10 or more | 80 | 1054 | 52.7 | 38.2 | 47.70% | 14.00% |

*P*values are displayed. The number of type I errors in each of these abstracts was simulated 10,000 times and the average results are reported: total number of type I errors in all included abstracts, number, and percentage of abstracts that reported a type I error, and the percentage of studies that reported more than one error.

*P*values were likely part of exploratory analyses where an inflated FWER is acceptable, a fact that would need to be addressed a priori and cannot be discerned by any a posteriori analysis. Nevertheless, 1.2% is a very low estimation, especially when compared with other literature reviews where the number of studies reporting correction factors is often 40% and as high as 60%.

^{31}

*P*values is expected to be an underestimate of the total number of statistical comparisons conducted.

^{11}Inevitably, there are numerous statistical comparisons that were conducted but not reported, whether due to a nonsignificant result, space constraints, or other reasons. An evidence of this fact is that a small number of abstracts reported the use of a correction factor but reported no

*P*values. The results of this and any similar analysis, therefore, will underestimate the number of statistical comparisons conducted by researchers, which leads to an underestimate of the type I error rates and an underestimate of the need for multiple comparison corrections.

^{13,23}Any retrospective analysis of studies without a priori knowledge of inherent correlations in statistical comparisons or knowledge of whether the study is “exploratory” versus “confirmatory” in nature would lead to arbitrary results, at best. Additionally, it should be noted that if a study has numerous, unrelated comparisons that are “exploratory” in nature, this does not decrease the FWER of the study; it only makes an elevated type I error rate more acceptable. For example, suppose a study uses one data set to conduct 10 related comparisons, whereas another study uses 10 different data sets to conduct 10 different comparisons. Both studies conduct 10 comparisons, which at an alpha level of 0.05 results in a type I error rate (FWER) of 40%. Although the elevated type I error rate is less desirable in the study that uses one data set because the comparisons are related, the probability of type I error remains identical between the two studies. The approach used in this analysis of identifying statistically rigorous studies with high FWER provides the best available sample of studies where a correction factor would need to be considered.

*P*values. It would be illogical and inconsistent to report a corrected

*P*value without reporting the new alpha level; such a practice would leave the reader unable to interpret any results. We assumed, therefore, that if specific

*P*values were reported and a correction was conducted, it was mentioned in the text. Space and time constraints in conference abstracts may lead to increased publication bias, but this would only result in an underestimate of the type I error rate. The simulation analysis in this study resulted in nearly one third of all statistically rigorous studies reporting a type I error. Although these numbers may underestimate the error rate, they illustrate very well the need for more liberal use of multiple comparison corrections in ophthalmic research.

*J Am Stat Assoc.*1955;50:1096–1121. [CrossRef]

*N Engl J Med.*1985;313:1450–1456. [CrossRef] [PubMed]

*Stat Med.*1991;10:871– 889 ; discussion 889–890. [CrossRef] [PubMed]

*Am J Epidemiol.*1995;142:904–908. [PubMed]

*BMJ.*1998;316:1236–1238. [CrossRef] [PubMed]

*J Clin Epidemiol.*2001;54:343–349. [CrossRef] [PubMed]

*Am J Epidemiol.*1998;147:807– 812 ; discussion 815. [CrossRef] [PubMed]

*Am J Epidemiol.*1997;145:84–85. [CrossRef] [PubMed]

*Epidemiology.*1990;1:43–46. [CrossRef] [PubMed]

*Am J Epidemiol.*1998;147:801–806. [CrossRef] [PubMed]

*Int J Epidemiol.*2008;37:430–434. [CrossRef] [PubMed]

*Clin Trials.*2005;2:394–399. [CrossRef] [PubMed]

*Control Clin Trials.*2000;21:527–539. [CrossRef] [PubMed]

*Health Serv Res.*2006;41:804–818. [CrossRef] [PubMed]

*How to Report Statistics in Medicine: Annotated Guidelines for Authors, Editors, and Reviewers*. 2nd ed. Philadelphia: American College of Physicians Press; 2006: xxii, 490.

*Epidemiol Rev.*2002;24:26–38. [CrossRef] [PubMed]

*Adv Physiol Educ.*2004;28:85–87. [CrossRef] [PubMed]

*Can J Anesth.*2011;58:668–696. [CrossRef]

*BMC Med Res Methodol.*2001;1: Art. 2.

*JAMA.*2001;285:1987–1991. [CrossRef] [PubMed]

*Technical Methods Report: Guidelines for Multiple Testing in Impact Evaluations.*Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education; 2008;.

*Clin Cancer Res.*2008;14:4368–4371. [CrossRef] [PubMed]

*Stat Med*1995;14:1659–1682. [CrossRef] [PubMed]

*Clin Chim Acta.*2005;361:128–134. [CrossRef] [PubMed]

*Med Hypotheses.*2005;65:395–399. [CrossRef] [PubMed]

*J Cataract Refract Surg.*2004;30:2005–2006. [CrossRef] [PubMed]

*J Cataract Refract Surg.*2004;30:2207–2208. [CrossRef] [PubMed]

*Invest Ophthalmol Vis Sci.*2011;52:6059–6065. [CrossRef] [PubMed]

*R: A Language and Environment for Statistical Computing.*Vienna, Austria; 2011.

*Am J Physiol Regul Integr Comp Physiol.*2000;279:R1– R8. [PubMed]