Free
Focus on Data  |   August 2020
Focus on Data: Statistical Significance, Effect Size and the Accumulation of Evidence Achieved by Combining Study Results Through Meta-analysis
Author Affiliations & Notes
  • Johannes Ledolter
    Department of Business Analytics, Tippie College of Business, University of Iowa, Iowa City, IA, United States
    Center for the Prevention and Treatment of Visual Loss, Iowa City VA Health Care System, Iowa City, IA, United States
  • Randy H. Kardon
    Department of Ophthalmology and Visual Sciences, University of Iowa, Iowa City, IA, United States
    Center for the Prevention and Treatment of Visual Loss, Iowa City VA Health Care System, Iowa City, IA, United States
  • Correspondence: Johannes Ledolter, Department of Business Analytics, Tippie College of Business, University of Iowa, Iowa City, IA, 52242, USA; johannes-ledolter@uiowa.edu
Investigative Ophthalmology & Visual Science August 2020, Vol.61, 32. doi:https://doi.org/10.1167/iovs.61.10.32
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Johannes Ledolter, Randy H. Kardon; Focus on Data: Statistical Significance, Effect Size and the Accumulation of Evidence Achieved by Combining Study Results Through Meta-analysis. Invest. Ophthalmol. Vis. Sci. 2020;61(10):32. doi: https://doi.org/10.1167/iovs.61.10.32.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose: To provide information and perspectives on statistical significance and on meta-analysis, a statistical procedure for combining estimated effects across multiple studies.

Methods: Methods are presented for performing a meta-analysis in which results across multiple studies are combined. An example of a meta-analysis of optical coherence tomography thickness of the retina in patients with multiple sclerosis across multiple studies is provided. We show how to combine individual study results and how to weight the results of each study based on its reliability. The method of a meta-analysis is used to derive from all study results a pooled estimate that is closest to the unknown common effect.

Results: Differences between the two most common methods for meta-analysis, the fixed-effects approach and the random-effects approach, are reviewed. Meta-analysis is applied to the study of the differences in the thickness of the retinal nerve fiber layers of healthy controls and patients with multiple sclerosis, showing why this is a useful procedure for combining estimated effects across multiple studies to derive the magnitude of retinal thinning caused by multiple sclerosis.

Conclusions: This review provides information and perspectives on statistical significance and on meta-analysis, a statistical procedure for combining estimated effects across multiple studies. A discussion is provided to show why statistical significance and low probability values are not all that matter and why investigators should also look at the magnitude of the estimated effects. Combining estimated effects across multiple studies with proper weighting of individual results is the goal of meta-analysis.

Statistical Significance Is Not All That Matters
In 2005, Ioannidis1,2 wrote several influential articles that suggest that up to 50% of biomedical studies are not reproducible. Why is this so? Various explanations can be given to support this claim, among them are the following. 
  • Institutional pressure to be funded and published may lead to either intentional or unintentional “chasing of probability values” and to conclude that results are significant, even though a fresh, unbiased view of the evidence may reveal otherwise. False-positive conclusions may result from the inappropriate treatment of outliers, the use of incorrect statistical analysis methods, ignoring violations of assumptions (e.g., non-normality of data, unequal variances, and missing values) that are critically relevant to the adopted methods, or chance alone. In the end, this can result in a “looking away” from facts that may contradict what one wants to see.
  • Publication bias. Only statistically significant results tend to get published. This bias excludes studies that have been underpowered from the start, with little chance of detecting meaningful effects. The result is the publishing of small studies only when the results are statistically significant, which will happen 5% of the time even when there are no differences (assuming an alpha level of 0.05 for statistical testing).
  • Statistical significance and low probability values are not all that matters. One must also look at the magnitude of the estimated effects. Cohen's d (Cohen3) relates the difference of two group means to the pooled standard deviation; it relates the size of the effect to the standard deviation of individual measurements. General rule of thumb guidelines consider Cohen's d of 0.2 as a small effect, 0.5 as a medium-sized effect, and 0.8 as a large effect. Cohen's d supplements the results of inferential testing and provides perspective on meaningful effects. Statistical significance does not amount to much if the magnitude of the estimated effect is not scientifically or clinically relevant. One must not confuse statistical significance of estimated effects with the practical significance of estimated effects. Probability values alone do not tell the complete story, as even small and meaningless effects can be identified as significant with a large sample size.
  • Repeated positive results, even though not statistically significant in each individual study, can add up to significant findings if results are combined through meta-analysis, a method that is discussed in the following section, Borrowing Strength: Combining Results From Different Studies Through Meta-Analysis. However, meta-analysis is compromised when nonsignificant studies remain unpublished, because then only the positive studies are combined, biasing the result toward significance.
  • Confidence intervals are preferable to probability values. Confidence intervals express both the magnitude of the estimated effect and the uncertainty of the estimate. The uncertainty of the estimate gives perspective on how likely the results are repeatable.
  • Probability values provided on a continuous scale are preferable to a binary (no/yes) report of statistical significance at an arbitrarily chosen criteria level (e.g., P ≤ 0.05). A result with a probability value of 0.105 is not all that different from one with a probability value of 0.095 or even 0.047, especially if the dataset is small and if there is uncertainty whether all assumptions that went into the statistical test that generated the probability value were actually satisfied.
In a 2019 special issue of The American Statistician (see Wasserstein et al.4), the American Statistical Association recommends against abusive use of probability values; the lead editorial suggests abandoning the use of the term “statistically significant” altogether. 
Borrowing Strength: Combining Results From Different Studies Through Meta-Analysis
A meta-analysis is a statistical procedure for combining estimated effects across multiple studies. Individual study results are measured with their errors and calculated confidence intervals. The aim of a meta-analysis is to derive a pooled estimate that is closest to the unknown common effect. 
Although there are many different methods for meta-analysis, with each version making slightly different assumptions, all existing methods yield a weighted average of individual study results. The difference is in the way these weights and the uncertainty (confidence interval) of the resulting weighted estimate are calculated. 
A meta-analysis assumes that the results of multiple studies are independent. Studies that are combined should include different patients, and must not just reanalyze the same experimental data. Independence implies that results of one study have no bearing on the results of the other studies. 
Example
Retinal imaging biomarkers are important for early recognition and monitoring of inflammation and neurodegeneration in multiple sclerosis (MS). With the introduction of spectral domain optical coherence tomography, measurements on the thickness of retinal nerve fiber layers are now readily available, but it is important to know which retinal layers show atrophy associated with neurodegeneration in MS. Petzold et al.5 conducted a comprehensive review of published studies that addressed this issue and they used meta-analysis methods to combine the results of all relevant studies. Our illustration here compares the thickness of the peripapillary retinal nerve fiber layer in eyes of healthy controls with eyes of patients having MS, but without a history of optic neuritis. The Table summarizes the results of the 18 studies identified in the review by Petzold et al.5 
Table.
 
Results of 18 Studies Assessing the Difference in Mean Thickness (in Microns) of the Peripapillary Retinal Nerve Fiber Layer in Eyes of Healthy Control Patients and in Eyes of Patients With MS Without Optic Neuritis
Table.
 
Results of 18 Studies Assessing the Difference in Mean Thickness (in Microns) of the Peripapillary Retinal Nerve Fiber Layer in Eyes of Healthy Control Patients and in Eyes of Patients With MS Without Optic Neuritis
The Table reports for each study the number of enrolled participants and the mean and standard deviation of the retinal nerve fiber layer thickness (in microns), for both the MS group and the group of healthy controls. The difference of the two group means, y, is reported in (bold face) column 8 of the Table. A negative difference (effect) indicates that the mean retinal nerve fiber layer thickness of MS patients is smaller than that of healthy controls. The standard error of the estimated effect, \(\hat \sigma = \sqrt {(s_{MS}^2/{n_{MS}}) + (s_{HC}^2/{n_{HC}})} \), is shown in (bold face) column 9. It is derived from the standard deviations and the sample sizes of the two groups. The limits of the 95% confidence interval for a study's mean effect are given in columns 10 and 11. We use percentiles of the standard normal distribution for calculating the confidence intervals. Alternatively, one can use percentiles of the t-distribution, with degrees of freedom given by the Welch approximation; see Ledolter et al.6 However, this makes little difference as the sample sizes are fairly large. The t-ratio for the mean difference is shown in column 12. The two-sided probability value, testing whether or not the mean difference is significant, is shown in column 13; three of the 18 studies (with P values greater than 0.05 in bold face) were considered not statistically significant. 
We use this example to illustrate the basic concepts behind a meta-analysis. The investigator must make certain choices on methods (described elsewhere in this article) when carrying out a meta-analysis, and these choices can affect the results. Also, it is important to establish objective criteria for including studies, because the results of a meta-analysis depend on which studies are included. 
Method 1: The Fixed-Effects Model
The fixed-effects model calculates a weighted average of the reported estimated study effects, yi, in column 8 of the Table. The sample variance of an estimated study effect, denoted by \(\hat \sigma _i^2\), reflects the reliability of the estimate; it is obtained by squaring the standard error explained above and shown in column 9 of the Table. The reciprocal of this variance, \(\hat \sigma _i^{ - 2}\), represents the weight that is attached to the ith study effect, so that reliable effects (less variability in the effect across the patients studied) contribute to the weighted average more than unreliable ones (more variability in the effect across patients). The weights \(\hat \sigma _i^{ - 2}\) and the normalized weights \({w_i} = \frac{{\hat \sigma _i^{ - 2}}}{{\sum {\hat \sigma _i^{ - 2}} }}\) are shown in the last two columns of the Table. The weighted (pooled) average of the n (here n = 18) estimated study effects  
\begin{equation*} {\textit{effect}_{pooled}} = \sum\nolimits_{i = 1}^n {{w_i}} {y_i}\end{equation*}
is the estimate of the unknown common treatment effect. The standard error of the pooled estimate is given by  
\begin{equation*}se({\textit{effect}_{pooled}}) = \sqrt {\frac{1}{{\sum {\hat \sigma _i^{ - 2}} }}} .\end{equation*}
 
These are the generalized least squares estimates of a population mean and its standard error when observations have unequal variances \(\hat \sigma _i^2\); see Abraham and Ledolter7 (page 128). 
For the example in the Table, effectpooled = −7.70 microns and se(effectpooled) = 0.378 microns. The 95% confidence interval of −7.70 ± (1.96)(0.378) extends from −8.44 to −6.96. The probability value for testing the hypothesis whether or not the common effect is zero is less than 0.0001. 
Method 2: The Random-Effects Model
The fixed-effects model assumes that all included studies are drawn from the same population. This assumption is unrealistic because the studies are heterogeneous and treatment or disease effects differ owing to diverse measurement devices and algorithms, and local study conditions including genetic and environmental influences on the populations being studied. The random-effects model relaxes this assumption, which makes it a more realistic model in most situations. 
The random-effects model assumes that the treatment effect from the ith study, yi, is distributed as \({y_i}|{\mu _i}\sim N({\mu _i},\sigma _i^2)\), where µi is the true underlying treatment effect of the ith study and \(\sigma _i^2\) is the corresponding within-study variance. This variance is estimated by \(\hat \sigma _i^2\), the squared entry of column 9 of the Table. The random-effects model further assumes that \({\mu _i}\sim N(\mu ,{\tau ^2})\), where µ and τ2 denote the overall treatment effect and the between-study variance, respectively. These two distributions imply the marginal distribution \({y_i}\sim N(\mu ,\sigma _i^2 + {\tau ^2})\)
Random-effects procedures for a meta-analysis differ by how the between-study variance τ2 gets estimated. The DerSimonian and Laird8 estimate, \(\hat \tau _{DL}^2\), is commonly used. DerSimonian and Laird use the Q statistic, \(Q = \sum\nolimits_{i = 1}^n {\sigma _i^{ - 2}} {({y_i} - \bar y)^2}\), where \(\bar y = {{\sum\nolimits_{i = 1}^n {\sigma _i^{ - 2}} {y_i}} \big/ {\sum\nolimits_{i = 1}^n {\sigma _i^{ - 2}} }} = \sum\nolimits_{i = 1}^n {{w_i}} {y_i}\) is the pooled estimate under the fixed-effects model. Under the assumptions of the random-effects model, the expectation of Q is \(E(Q) = (n - 1) + ({S_1} - \frac{{{S_2}}}{{{S_1}}}){\tau ^2}\), where \({S_1} = \sum\nolimits_{i = 1}^n {\sigma _i^{ - 2}} \)and \({S_2} = \sum\nolimits_{i = 1}^n {\sigma _i^{ - 4}} \). Replacing all unknown variances \(\sigma _i^2\) with their estimates \(\hat \sigma _i^2\) and solving the equation Q = E(Q) for τ2 leads to the DerSimonian and Laird estimate \(\hat \tau _{DL}^2 = \max (0,\frac{{Q - (n - 1)}} {{{S_1} - \frac{{{S_2}}}{{{S_1}}}}})\). With this estimate, the generalized least squares estimate of the common treatment effect µ is given by the weighted average  
\begin{eqnarray*}{\textit{effect}_{DL}} &=& \frac{{\sum\nolimits_{i = 1}^n {{{(\hat \sigma _i^2 + \hat \tau _{DL}^2)}^{ - 1}}{y_i}} }}{{\sum\nolimits_{i = 1}^n {{{(\hat \sigma _i^2 + \hat \tau _{DL}^2)}^{ - 1}}} }} = \sum\nolimits_{i = n}^n {{{\tilde w}_i}} {y_i},\nonumber\\ \hbox{with weights }{\tilde w_i} &=& \frac{{{{(\hat \sigma _i^2 + \hat \tau _{DL}^2)}^{ - 1}}}}{{\sum\nolimits_{i = 1}^n {{{(\hat \sigma _i^2 + \hat \tau _{DL}^2)}^{ - 1}}} }}.\end{eqnarray*}
 
The standard error of this estimate is  
\begin{equation*}se({\textit{effect}_{DL}}) = \sqrt {\frac{1}{{\sum\nolimits_{i = 1}^n {{{(\hat \sigma _i^2 + \hat \tau _{DL}^2)}^{ - 1}}} }}} .\end{equation*}
 
For the example in the Table, \(\hat \tau _{DL}^2 = 7.21\). The standard deviation \({\hat \tau _{DL}} = \sqrt {7.21} = 2.69\) microns reflects the between-study variability. The estimate of the overall treatment effect is effectDL = −7.41 microns, with a standard error se(effectDL) = 0.805 microns. The approximate 95% confidence interval −7.41± (1.96)(0.805) extends from −8.98 to −5.83, and the probability value for testing the hypothesis whether or not the common effect is zero is less than 0.0001. 
The random-effects model allows for variability among the study effects, whereas in the fixed-effects model all study effects are assumed to originate from a single common mean. In our example, the two pooled estimates, effectpooled = −7.70 and effectDL = −7.41, are about the same. However, because of the large between-study variability, the standard error from the random-effects model, se(effectDL) = 0.805, is twice as large as the standard error from the fixed-effects model se(effectpooled) = 0.378. However, the conclusions are unchanged; either method of meta-analysis confirms highly significant differences in the mean retinal thickness of healthy controls and patients with MS. 
Furthermore, the magnitudes of the estimated effects are large. For each study we pool the standard deviations of the MS and healthy control groups in columns 4 and 7 of the Table, \({s_{pooled}} = \sqrt {[({n_{MS}} - 1)s_{MS}^2 + ({n_{HC}} - 1)s_{HC}^2]/({n_{MS}} + {n_{HC}} - 2)} \). We then average the standard deviation estimates across the 18 studies, using the fixed effects weights in column 15 of the Table as an indication of their reliability. The resulting estimate, s = 10.63 microns, is insensitive to the weights being used (the equally weighted average is 10.87; the weighted average with weights from the random-effects is 10.53). The standard deviation s = 10.63 leads to a Cohen's d in the medium to large category (7.70/10.63 = 0.72 and 7.41/10.63 = 0.70). 
Software
The R library meta9 can be used to carry out the analysis and visualize its results. The forest plot in the Figure displays the estimated study effects by squares and their confidence intervals by horizontal lines. The area of the square reflects the precision of the treatment estimate. The vertical line through zero represents the no effect hypothesis. Confidence intervals for individual studies that overlap with this line demonstrate that their effects do not differ from zero. The weights for fixed and random-effects meta-analyses and their resulting pooled estimates, visualized by vertical lines and diamonds as labels, are shown. In this example, the differences between the two estimates and the two vertical lines are minor. The name forest plot refers to the forest of lines it produces. Graphpad Prism 8 also can be used to produce forest plots. 
Figure.
 
Results of the meta-analysis of the data from the Table. CI, confidence interval.
Figure.
 
Results of the meta-analysis of the data from the Table. CI, confidence interval.
Summary
Rigor and reproducibility are now emphasized in the design and analysis of experiments to help ensure that results hold up to the test of time and can be replicated in other studies. We have outlined possible factors that can compromise the reproducibility of published results that have been interpreted as being significant. We have also emphasized the importance of characterizing the effect size, Cohen's d, based on the difference of means relative to the standard deviation of the results. The major emphasis in this tutorial is on how to best combine the results of various studies, through a meta-analysis approach to determine if an overall effect is likely to be significant and consistent across similar studies, which helps to bolster the rigor and repeatability of conclusions. 
Acknowledgments
Supported by the Center for the Prevention and Treatment of Visual Loss, Iowa City VA Health Care Center (RR&D C9251-C; RX003002), and an endowment from the Pomerantz Family Chair in Ophthalmology (RK). 
Disclosure: J. Ledolter, None; R.H. Kardon, None 
References
Ioannidis JPA . Contradicted and initially stronger effects in highly cited clinical research. JAMA. 2005; 294: 218–228. [CrossRef] [PubMed]
Ioannidis JPA . Why most published research findings are false. PLoS Med. 2005; 2: e124. [CrossRef] [PubMed]
Cohen J. Statistical Power Analysis for the Behavioral Sciences. New York: Routledge; 1988.
Wasserstein RL, Schirm AL, Lazar NA. Moving to a world beyond “p  < 0.05.” Editorial. Am Stat. 2019; 73: 1–19. [CrossRef]
Petzold A, Balcer LJ, Calabresi PA, et al. Retinal layer segmentation in multiple sclerosis: a systematic review and meta-analysis. Lancet Neurol. 2017; 16: 797–812. [CrossRef] [PubMed]
Ledolter J, Gramlich OW, Kardon RH. Focus on data: parametric statistical inference. Invest Ophthalmol Vis Sci. 2020; 61: 11. [CrossRef] [PubMed]
Abraham B, Ledolter J. Introduction to Regression Modeling. Belmont, CA: Duxbury Press; 2006:128.
DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trial. 1986; 7: 177–188. [CrossRef]
R Software for Statistical Computing. Library meta. Vienna, Austria: R Foundation for Statistical Computing. Available at: www.R-project.org/.
Figure.
 
Results of the meta-analysis of the data from the Table. CI, confidence interval.
Figure.
 
Results of the meta-analysis of the data from the Table. CI, confidence interval.
Table.
 
Results of 18 Studies Assessing the Difference in Mean Thickness (in Microns) of the Peripapillary Retinal Nerve Fiber Layer in Eyes of Healthy Control Patients and in Eyes of Patients With MS Without Optic Neuritis
Table.
 
Results of 18 Studies Assessing the Difference in Mean Thickness (in Microns) of the Peripapillary Retinal Nerve Fiber Layer in Eyes of Healthy Control Patients and in Eyes of Patients With MS Without Optic Neuritis
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×