**Purpose**:
To provide information and perspectives on statistical significance and on meta-analysis, a statistical procedure for combining estimated effects across multiple studies.

**Methods**:
Methods are presented for performing a meta-analysis in which results across multiple studies are combined. An example of a meta-analysis of optical coherence tomography thickness of the retina in patients with multiple sclerosis across multiple studies is provided. We show how to combine individual study results and how to weight the results of each study based on its reliability. The method of a meta-analysis is used to derive from all study results a pooled estimate that is closest to the unknown common effect.

**Results**:
Differences between the two most common methods for meta-analysis, the fixed-effects approach and the random-effects approach, are reviewed. Meta-analysis is applied to the study of the differences in the thickness of the retinal nerve fiber layers of healthy controls and patients with multiple sclerosis, showing why this is a useful procedure for combining estimated effects across multiple studies to derive the magnitude of retinal thinning caused by multiple sclerosis.

**Conclusions**:
This review provides information and perspectives on statistical significance and on meta-analysis, a statistical procedure for combining estimated effects across multiple studies. A discussion is provided to show why statistical significance and low probability values are not all that matter and why investigators should also look at the magnitude of the estimated effects. Combining estimated effects across multiple studies with proper weighting of individual results is the goal of meta-analysis.

- • Institutional pressure to be funded and published may lead to either intentional or unintentional “chasing of probability values” and to conclude that results are significant, even though a fresh, unbiased view of the evidence may reveal otherwise. False-positive conclusions may result from the inappropriate treatment of outliers, the use of incorrect statistical analysis methods, ignoring violations of assumptions (e.g., non-normality of data, unequal variances, and missing values) that are critically relevant to the adopted methods, or chance alone. In the end, this can result in a “looking away” from facts that may contradict what one wants to see.
- • Publication bias. Only statistically significant results tend to get published. This bias excludes studies that have been underpowered from the start, with little chance of detecting meaningful effects. The result is the publishing of small studies only when the results are statistically significant, which will happen 5% of the time even when there are no differences (assuming an alpha level of 0.05 for statistical testing).
- • Statistical significance and low probability values are not all that matters. One must also look at the magnitude of the estimated effects. Cohen's d (Cohen
^{3}) relates the difference of two group means to the pooled standard deviation; it relates the size of the effect to the standard deviation of individual measurements. General rule of thumb guidelines consider Cohen's d of 0.2 as a small effect, 0.5 as a medium-sized effect, and 0.8 as a large effect. Cohen's d supplements the results of inferential testing and provides perspective on meaningful effects. Statistical significance does not amount to much if the magnitude of the estimated effect is not scientifically or clinically relevant. One must not confuse statistical significance of estimated effects with the practical significance of estimated effects. Probability values alone do not tell the complete story, as even small and meaningless effects can be identified as significant with a large sample size. - • Repeated positive results, even though not statistically significant in each individual study, can add up to significant findings if results are combined through meta-analysis, a method that is discussed in the following section, Borrowing Strength: Combining Results From Different Studies Through Meta-Analysis. However, meta-analysis is compromised when nonsignificant studies remain unpublished, because then only the positive studies are combined, biasing the result toward significance.
- • Confidence intervals are preferable to probability values. Confidence intervals express both the magnitude of the estimated effect and the uncertainty of the estimate. The uncertainty of the estimate gives perspective on how likely the results are repeatable.
- • Probability values provided on a continuous scale are preferable to a binary (no/yes) report of statistical significance at an arbitrarily chosen criteria level (e.g.,
*P*≤ 0.05). A result with a probability value of 0.105 is not all that different from one with a probability value of 0.095 or even 0.047, especially if the dataset is small and if there is uncertainty whether all assumptions that went into the statistical test that generated the probability value were actually satisfied.

*The American Statistician*(see Wasserstein et al.

^{4}), the American Statistical Association recommends against abusive use of probability values; the lead editorial suggests abandoning the use of the term “statistically significant” altogether.

^{5}conducted a comprehensive review of published studies that addressed this issue and they used meta-analysis methods to combine the results of all relevant studies. Our illustration here compares the thickness of the peripapillary retinal nerve fiber layer in eyes of healthy controls with eyes of patients having MS, but without a history of optic neuritis. The Table summarizes the results of the 18 studies identified in the review by Petzold et al.

^{5}

*y*, is reported in (bold face) column 8 of the Table. A negative difference (effect) indicates that the mean retinal nerve fiber layer thickness of MS patients is smaller than that of healthy controls. The standard error of the estimated effect, \(\hat \sigma = \sqrt {(s_{MS}^2/{n_{MS}}) + (s_{HC}^2/{n_{HC}})} \), is shown in (bold face) column 9. It is derived from the standard deviations and the sample sizes of the two groups. The limits of the 95% confidence interval for a study's mean effect are given in columns 10 and 11. We use percentiles of the standard normal distribution for calculating the confidence intervals. Alternatively, one can use percentiles of the t-distribution, with degrees of freedom given by the Welch approximation; see Ledolter et al.

^{6}However, this makes little difference as the sample sizes are fairly large. The t-ratio for the mean difference is shown in column 12. The two-sided probability value, testing whether or not the mean difference is significant, is shown in column 13; three of the 18 studies (with

*P*values greater than 0.05 in bold face) were considered not statistically significant.

*y*, in column 8 of the Table. The sample variance of an estimated study effect, denoted by \(\hat \sigma _i^2\), reflects the reliability of the estimate; it is obtained by squaring the standard error explained above and shown in column 9 of the Table. The reciprocal of this variance, \(\hat \sigma _i^{ - 2}\), represents the weight that is attached to the

_{i}*i*th study effect, so that reliable effects (less variability in the effect across the patients studied) contribute to the weighted average more than unreliable ones (more variability in the effect across patients). The weights \(\hat \sigma _i^{ - 2}\) and the normalized weights \({w_i} = \frac{{\hat \sigma _i^{ - 2}}}{{\sum {\hat \sigma _i^{ - 2}} }}\) are shown in the last two columns of the Table. The weighted (pooled) average of the

*n*(here

*n*= 18) estimated study effects

^{7}(page 128).

*effect*= −7.70 microns and

_{pooled}*se*(

*effect*) = 0.378 microns. The 95% confidence interval of −7.70 ± (1.96)(0.378) extends from −8.44 to −6.96. The probability value for testing the hypothesis whether or not the common effect is zero is less than 0.0001.

_{pooled}*i*th study,

*y*, is distributed as \({y_i}|{\mu _i}\sim N({\mu _i},\sigma _i^2)\), where µ

_{i}_{i}is the true underlying treatment effect of the

*i*th study and \(\sigma _i^2\) is the corresponding within-study variance. This variance is estimated by \(\hat \sigma _i^2\), the squared entry of column 9 of the Table. The random-effects model further assumes that \({\mu _i}\sim N(\mu ,{\tau ^2})\), where µ and τ

^{2}denote the overall treatment effect and the between-study variance, respectively. These two distributions imply the marginal distribution \({y_i}\sim N(\mu ,\sigma _i^2 + {\tau ^2})\).

^{2}gets estimated. The DerSimonian and Laird

^{8}estimate, \(\hat \tau _{DL}^2\), is commonly used. DerSimonian and Laird use the Q statistic, \(Q = \sum\nolimits_{i = 1}^n {\sigma _i^{ - 2}} {({y_i} - \bar y)^2}\), where \(\bar y = {{\sum\nolimits_{i = 1}^n {\sigma _i^{ - 2}} {y_i}} \big/ {\sum\nolimits_{i = 1}^n {\sigma _i^{ - 2}} }} = \sum\nolimits_{i = 1}^n {{w_i}} {y_i}\) is the pooled estimate under the fixed-effects model. Under the assumptions of the random-effects model, the expectation of

*Q*is \(E(Q) = (n - 1) + ({S_1} - \frac{{{S_2}}}{{{S_1}}}){\tau ^2}\), where \({S_1} = \sum\nolimits_{i = 1}^n {\sigma _i^{ - 2}} \)and \({S_2} = \sum\nolimits_{i = 1}^n {\sigma _i^{ - 4}} \). Replacing all unknown variances \(\sigma _i^2\) with their estimates \(\hat \sigma _i^2\) and solving the equation

*Q*=

*E*(

*Q*) for τ

^{2}leads to the DerSimonian and Laird estimate \(\hat \tau _{DL}^2 = \max (0,\frac{{Q - (n - 1)}} {{{S_1} - \frac{{{S_2}}}{{{S_1}}}}})\). With this estimate, the generalized least squares estimate of the common treatment effect µ is given by the weighted average

*effect*= −7.41 microns, with a standard error

_{DL}*se*(

*effect*) = 0.805 microns. The approximate 95% confidence interval −7.41± (1.96)(0.805) extends from −8.98 to −5.83, and the probability value for testing the hypothesis whether or not the common effect is zero is less than 0.0001.

_{DL}*effect*= −7.70 and

_{pooled}*effect*= −7.41, are about the same. However, because of the large between-study variability, the standard error from the random-effects model,

_{DL}*se*(

*effect*) = 0.805, is twice as large as the standard error from the fixed-effects model

_{DL}*se*(

*effect*) = 0.378. However, the conclusions are unchanged; either method of meta-analysis confirms highly significant differences in the mean retinal thickness of healthy controls and patients with MS.

_{pooled}*s*= 10.63 microns, is insensitive to the weights being used (the equally weighted average is 10.87; the weighted average with weights from the random-effects is 10.53). The standard deviation

*s*= 10.63 leads to a Cohen's d in the medium to large category (7.70/10.63 = 0.72 and 7.41/10.63 = 0.70).

^{9}can be used to carry out the analysis and visualize its results. The forest plot in the Figure displays the estimated study effects by squares and their confidence intervals by horizontal lines. The area of the square reflects the precision of the treatment estimate. The vertical line through zero represents the no effect hypothesis. Confidence intervals for individual studies that overlap with this line demonstrate that their effects do not differ from zero. The weights for fixed and random-effects meta-analyses and their resulting pooled estimates, visualized by vertical lines and diamonds as labels, are shown. In this example, the differences between the two estimates and the two vertical lines are minor. The name forest plot refers to the forest of lines it produces. Graphpad Prism 8 also can be used to produce forest plots.

**J. Ledolter,**None;

**R.H. Kardon,**None

*JAMA*. 2005; 294: 218–228. [CrossRef] [PubMed]

*PLoS Med*. 2005; 2: e124. [CrossRef] [PubMed]

*Statistical Power Analysis for the Behavioral Sciences*. New York: Routledge; 1988.

*Am Stat*. 2019; 73: 1–19. [CrossRef]

*Lancet Neurol*. 2017; 16: 797–812. [CrossRef] [PubMed]

*Invest Ophthalmol Vis Sci*. 2020; 61: 11. [CrossRef] [PubMed]

*Introduction to Regression Modeling*. Belmont, CA: Duxbury Press; 2006:128.

*Control Clin Trial*. 1986; 7: 177–188. [CrossRef]

*Library meta*. Vienna, Austria: R Foundation for Statistical Computing. Available at: www.R-project.org/.