**Purpose**:
The purpose of this tutorial is to provide visual scientists with various approaches for comparing two or more groups of data using parametric statistical tests, which require that the distribution of data within each group is normal (Gaussian). Non-parametric tests are used for inference when the sample data are not normally distributed or the sample is too small to assess its true distribution.

**Methods**:
Methods are reviewed using retinal thickness, as measured by optical coherence tomography (OCT), as an example for comparing two or more group means. The following parametric statistical approaches are presented for different situations: two-sample t-test, Analysis of Variance (ANOVA), paired t-test, and the analysis of repeated measures data using a linear mixed-effects model approach.

**Results**:
Analyzing differences between means using various approaches is demonstrated, and follow-up procedures to analyze pairwise differences between means when there are more than two comparison groups are discussed. The assumption of equal variance between groups and methods to test for equal variances are examined. Examples of repeated measures analysis for right and left eyes on subjects, across spatial segments within the same eye (e.g. quadrants of each retina), and over time are given.

**Conclusions**:
This tutorial outlines parametric inference tests for comparing means of two or more groups and discusses how to interpret the output from statistical software packages. Critical assumptions made by the tests and ways of checking these assumptions are discussed. Efficient study designs increase the likelihood of detecting differences between groups if such differences exist. Situations commonly encountered by vision scientists involve repeated measures from the same subject over time, measurements on both right and left eyes from the same subject, and measurements from different locations within the same eye. Repeated measurements are usually correlated, and the statistical analysis needs to account for the correlation. Doing this the right way helps to ensure rigor so that the results can be repeated and validated.

^{1}and they are summarized here. There are three treatment groups: control mice, diseased mice (EAE) with optic neuritis, and treated diseased mice (EAE + treatment). For the purpose of this tutorial, we consider only mice with measurements made on both eyes. This leaves us with 15, 12, and six subjects (mice) in the three groups, respectively. For the various statistical analyses in this tutorial, the variance (

*s*

^{2}) is defined as the sum of the squared differences of each sample from their sample mean, which is then divided by the number of samples minus 1 (subtracting 1 corrects for the sample bias). The standard deviation is the square root of the variance. The software programs Prism 8 (GraphPad, San Diego, CA, USA) and Minitab (State College, PA, USA) were used to generate the graphs shown in this tutorial.

*t*-Test

*t*-test relates the difference of the sample means \({\bar y_A} - {\bar y_B}\) to its estimated standard error, \(s{e_{{{\bar y}_A} - {{\bar y}_B}}} = \scriptstyle{\sqrt {\frac{{s_A^2}}{{{n_A}}} + \frac{{s_B^2}}{{{n_B}}}}} \). Here, \({n_A},{\bar y_A},{s_A}\) and \({n_B},{\bar y_B},{s_B}\) are the sample size, mean, and standard deviation for each of the two groups.

*t*-distribution, with its degrees of freedom \(\scriptstyle{\frac{{{{[(s_A^2/{n_A}) + (s_B^2/{n_B})]}^2}}}{{(s_A^4/n_A^2({n_A} - 1)) + (s_B^4/n_B^2({n_B} - 1))}}}\) given from the Welch approximation.

^{2}Confidence intervals and probability values can be calculated. Small probability values (smaller than 0.05 or 0.10) indicate that the null hypothesis of no difference between the means can be rejected. Note that, although traditionally a probability of <0.05 has been considered significant, some groups favor an even more stringent criterion, but others feel that a less conservative criterion (e.g.,

*P*< 0.1) may still be meaningful, depending on the context of the study.

*t*-distribution with

*n*−

_{A}*n*− 2 degrees of freedom. However, we prefer the first method, where the standard error of each group is calculated separately (not pooled), and the Welch approximation of the degrees of freedom, as it does not require that the two group variances be the same. The pooled version of the test assumes equal variances and can be misleading when they are not.

_{B}^{3}Both

*t*-tests are robust to non-normality as long as the sample sizes are reasonably large (sample sizes of 30 or larger; robustness follows from the central limit effect).

*P*value (0.0001) shows that this difference is quite significant, leaving little doubt that the disease leads to thinning of the inner retinal layer (Table 1).

*k*groups (for our illustration,

*k*= 3) with observations

*y*for

_{ij}*i*= 1, 2, …,

*k*and

*j*= 1, 2, …,

*n*(number of observations in the

_{i}*i*th group). The ANOVA table partitions the sum of squared deviations of the \(n = \sum\nolimits_{i = 1}^k {{n_i}} \) observations from their overall mean, \(\bar y\), into two components: the between-group (or treatment) sum of squares, \(SSB = \sum\nolimits_{i = 1}^k {{n_i}({{\bar y}_i}} - \bar y{)^2}\), expressing the variability of the group means \({\bar y_i}\) from the overall mean \(\bar y\), and the within-group (or residual) sum of squares, \(SSW = \sum\nolimits_{i = 1}^k {\{ {\sum\nolimits_{j = 1}^{{n_{i}}} {({y_{ij}} - {{\bar y}_i}} {)^2}} \}} = \sum\nolimits_{i = 1}^k {({n_i} - 1)s_i^2} \), adding up all within-group variances, \(s_i^2\). The ratio of the resulting mean squares (where mean squares are obtained by dividing sums of squares by their degrees of freedom), \(F = \frac{{SSB/(k - 1)}}{{SSW/(n - k)}}\), serves as the statistic for testing the null hypothesis that all group means are equal. The probability value for testing this hypothesis can be obtained from the

*F*-distribution. Small probability values (smaller than 0.05 or 0.10) indicate that the null hypothesis should be rejected.

^{3}showed that the

*F*-test is sensitive to violations of the equal variance assumption, especially if the sample sizes in the groups are different. The

*F*-test is less affected by unequal variances if the sample sizes are equal. Although the

*F*-test assumes normality, it is robust to non-normality as long as the sample sizes are reasonably large (e.g., 30 samples per group).

*t*-test that uses the pooled variance. Earlier we had recommended the Welch approximation, which uses a different standard error calculation for the difference of two sample means, as it does not assume equal variances. Useful tests for the equality of variances are discussed later.

^{4}does (Table 2, Fig. 1). Many other multiple comparison procedures are available (Bonferroni, Scheffe, Sidak, Holm, Dunnett, Benjamini–Hochberg), but their discussion would go beyond this introduction. For a discussion of the general statistical theory of multiple comparisons, see Hsu.

^{5}

*P*= 0.0001). Tukey pairwise comparisons show differences between the group means of thickness for control and EAE and for control and EAE + treatment. The means of EAE and EAE + treatment are not significantly different.

^{6}(see Snedecor and Cochran

^{7}) is employed for testing if two or more samples are from populations with equal variances. Equal variances across populations are referred to as homoscedasticity or homogeneity of variances. The Bartlett test compares each group variance with the pooled variance and is sensitive to departures from normality. The tests by Levene

^{8}and Brown and Forsythe

^{9}are good alternatives that are less sensitive to departures from normality. These tests make use of the results of a one-way ANOVA on the absolute value of the difference between measurements and their respective group mean (Levine test) or their group median (for the Brown–Forsythe test).

*P*value was in rejecting the null hypothesis of equal variance. If a fair amount of uncertainty remains, then alternative approaches are discussed in the next section.

^{10}discussed transformations that stabilize the variability so that the variances in the groups are the same. A logarithmic transformation is indicated when the standard deviation in a group is proportional to the group mean; a square root transformation is indicated when the variance is proportional to the mean. Reciprocal transformations are useful if one studies the time from the onset of a disease (or of a treatment) to a certain failure event such as death or blindness. The reciprocal of time to death, which expresses the rate of dying, often stabilizes group variances. For details, see Box et al.

^{11}

*t*-Test

*t*-test considers treatment differences,

*d*, on

*n*different subjects and compares the sample mean (\(\bar d\)) to its standard error, \(s{e_{\bar d}} = {s_d}/\sqrt n \). Under the null hypothesis of no difference, the ratio (test statistic) \(\bar d/s{e_{\bar d}}\) has a

*t*-distribution with

*n*– 1 degrees of freedom, and confidence intervals and probability values can be calculated. Small probability values (usually smaller than 0.05 or 0.10) would indicate that the null hypothesis should be rejected.

*t*-test in Table 1 and the ANOVA in Table 2 used subject averages of the thickness of the right and left eyes. Switching to eyes as the unit of observation, it is tempting to run the same tests with twice the number of observations in each group, as now each subject provides two observations. But, if eyes on the same subject are correlated (in our illustration with 33 subjects, the correlation between OD and OS retinal thickness is very large:

*r*= 0.90), this amounts to “cheating,” as correlated observations carry less information than independent ones. By artificially inflating the number of observations and inappropriately reducing standard errors, the probability values appear more significant than they actually are.

*F*-test statistic. This shows that a strategy of adding more and more perfect replicates to each observation makes even the smallest difference significant. One cannot ignore the correlation among measurements on the same subject! The following two sections show how this correlation can be incorporated into the analysis.

- • α is an intercept.
- • β
_{j}in this example represents (three) fixed differential treatment effects, with β_{1}+ β_{2}+ β_{3}= 0. With this restriction, treatment effects are expressed as deviations from the average. An equivalent representation sets one of the three coefficients equal to zero, then the parameter of each included group represents the difference between the averages of the included group and the reference group for which the parameter has been omitted. - • π
_{i(j)}represents random subject effects, represented by a normal distribution with mean 0 and variance \(\sigma _\pi ^2\). The subscript notation*i*(*j*) expresses the fact that subject*i*is nested within factor*j*; that is, subject 1 in treatment group 1 is a different subject than subject 1 in treatment group 2. Each subject is observed under only a single treatment group. This is different from the “crossed” design where each subject is studied under all treatment groups. - • γ
_{k}represents fixed eye (OD, OS) effects with coefficients adding to zero: γ_{1}+ γ_{2}= 0. - • βγ
_{jk}represents the interaction effects between the two fixed effects, treatment and eye, with row and column sums of the array βγ_{jk}restricted to zero. There is no interaction when all βγ_{jk}are zero; this makes effects easier to interpret, as the effects of one factor do not depend on the level of the other. - • ε
_{i(j)k}represents random measurement errors, with a normal distribution, mean = 0, and variance = \(\sigma _\varepsilon ^2\). Measurement errors reflect the eye by subject (within treatment) interaction.

^{12}and McCulloch et al.

^{13}

^{14}These variabilities are estimated by the two mean square (MS) errors that are shown in Table 5 with bold-face type.

*F*(Treatment) = 287.2/23.45 = 12.25. The treatment effect is significant at

*P*= 0.0001. MS(Residual) = 2.192 is used in the test for subject effects and in tests of the main effect of eye and the eye × treatment interaction:

*F*(Subject) = 23.45/2.192 = 10.70 (significant;

*P*< 0.0001);

*F*(Eye) = 1.351/2.192 = 0.6164 (not significant,

*P*= 0.4385) and

*F*(Eye × Treatment) = 0.298/2.192 = 0.1360 (not significant,

*P*= 0.8734). In summary, the mean retinal thickness differs among the control, EAE, and EAE + treatment groups. Thickness varies widely among subjects, but difference in means between right and left eyes are not significant.

*F*-statistic to

*F*= 287.2/12.23 = 23.47 which is highly significant. However, such incorrect analysis that does not account for the high correlation between measurements on right and left eyes leads to wrong probability values and wrong conclusions. It makes the treatment effect appear even more significant than it really is. In this example, the conclusions about the factors are not changed, but that is not true in general for all cases.

*F*(Treatment) = 1148.93/93.80 = 12.25. MS(Residual) = 12.66 is used in all other tests (subject effects, main and interaction effects of eye and quadrant, and all of their interactions with treatment). Treatment and subject effects are highly significant, but all effects of eye and quadrant are insignificant, meaning that eyes and quadrants had no effect on retinal thickness.

**J. Ledolter**, None;

**O.W. Gramlich**, None;

**R.H. Kardon**, None

*Invest Ophthalmol Vis Sci.*2020; 0: 30023, https://doi.org/10.1167/iovs.0.0.30023.

*Biometrika*. 1947; 34: 28–35. [PubMed]

*Ann Math Stat*. 1954; 25: 290–302. [CrossRef]

*Exploratory Data Analysis*. Reading, MA: Addison-Wesley; 1977.

*Multiple Comparisons: Theory and Methods*. London: Chapman & Hall, 1996.

*Proc R Soc Lond A Math Phys Sci*. 1937; 160: 268–282.

*Statistical Methods*. 8th ed. Ames, IA: Iowa State University Press; 1989.

*Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling*. Palo Alto, CA: Stanford University Press; 1960: 278–292.

*J Am Stat Assoc*. 1974; 69: 364–367. [CrossRef]

*J R Stat Soc B Methodol*. 1964; 26: 211–243.

*Statistics for Experimenters: Design, Innovation, and Discovery*. 2nd ed. New York: John Wiley & Sons; 2005.

*Analysis of Longitudinal Data*. New York: Oxford University Press; 1994.

*Generalized, Linear, and Mixed Models*. 2nd ed. New York: John Wiley & Sons; 2008.

*Statistical Principles of Experimental Design*. 2nd ed. New York: McGraw-Hill; 1999.