Weread with great interest the article, “An analysis of the use of multiple comparison corrections in ophthalmology research.”
1 The authors reviewed abstracts from the 2012 ARVO Meeting to estimate the number of abstracts reporting false-positive results or where multiple comparison corrections needed to be considered. We would like to comment on the authors' methodology as well as provide a counterpoint to their presentation of the multiple comparison corrections.
Several limitations to the methodology are worth noting. First, the words “significant” and “statistically significant” were not included in the list of search terms. It is quite possible that an abstract makes reference to statistically significant results, but is devoid of
P values, and presumably would not have been included in their search (Gallagher MJ, et al.
IOVS 2010;51:ARVO E-Abstract 4555). Second, it also is quite possible that an abstract may not contain any mention of a multiple comparisons correction, but the actual paper does (Bousquet E, et al.
IOVS 2011;52:ARVO E-Abstract 6582; Lavery W, et al.
IOVS 2010; 51:ARVO E-Abstract 5036; Millard LH, et al. IOVS 2010;51:ARVO E-Abstract 151).
2–4 Given space constraints in a conference abstract, it would not be unusual for this information to be omitted. Thus, the authors likely have underestimated the number of analyses that corrected for multiple comparisons. Furthermore, they define abstracts reporting five or more
P values as those needing the use of a correction factor; however, they provide no scientific explanation for this criterion, nor do the authors use this criterion consistently in their own scientific research.
5,6
The theoretical basis behind adjusting for multiple comparisons is relatively straightforward. However, as noted above, we wish to provide some practical counterpoints to what the authors infer is a straightforward and universally accepted process.
First and foremost, the process of enumerating the relevant number of statistical tests is vital and has implications for the proper interpretation of the corrected results.
7–9 However, there are a number of important and often unanswered questions in this regard. What about tests that were performed, but not published? What about tests planned for future studies? What about other tests the investigator has done in her or his lifetime? This lack of consistency creates some practical problems. For example, imagine two published studies on the same topic, both adjusted for multiple comparisons. In theory, if the method of adjustment (e.g., Bonferroni, Tukey, Newman-Keuls), critical value used as the acceptable alpha level, or the number of other tests performed were different, the same results would be interpreted differently between the two studies. This defies common sense. As such, the lack of a gold standard approach for multiple test adjustment makes comparing results from one study to another very difficult.
The authors suggest that the paucity of multiple comparison corrections may produce unwarranted shifts in clinical or surgical care. Adjusting for multiple comparisons can cut both ways, as it results in a higher type II error rate by artificially setting the bar for statistical significance very high.
8,10 While it is more conservative than the conventional
P < 0.05, applying adjustment factors hampers researchers from taking on a true association and exploring it further.
8,10 In terms of clinical care, this means standard practices could continue being used when, in fact, a shift is warranted. In addition, the authors provide no evidence that failure to adjust has led to any shift in clinical care. In fact, common sense dictates that any changes in clinical care will not be based on the results of a single paper, corrected for multiple comparisons or not; changes in clinical care happen in an evolutionary manner. It is our belief that the lack of a standardized approach for multiple test adjustment is more likely to hamper shifts in clinical care than concern for the number of false-positive results, as the author suggests.
The recommended approach when dealing with multiple tests is to make use of the effect size in addition to the
P value. In addition, authors should point out any methodologic concerns that are relevant for making a quality explanation of an association, such as control selection, nonresponse, recall bias related to disease occurrence, interviewer bias, confounding, and prior evidence.
8 This should enable the reader to reach a reasonable conclusion without the help of adjustments.