**Purpose**:
To provide information to visual scientists on how to optimally design experiments and how to select an appropriate sample size, which is often referred to as a power analysis.

**Methods**:
Statistical guidelines are provided outlining good principles of experimental design, including replication, randomization, blocking or grouping of subjects, multifactorial design, and sequential approach to experimentation. In addition, principles of power analysis for calculating required sample size are outlined for different experimental designs and examples are given for calculating power and factors influencing it.

**Results**:
The interaction between power, sample size and standardized effect size are shown. The following results are also provided: sample size increases with power, sample size increases with decreasing detectable difference, sample size increases proportionally to the variance, and two-sided tests, without preference as to whether the mean increases or decreases, require a larger sample size than one-sided tests.

**Conclusions**:
This review outlines principles for good experimental design and methods for power analysis for typical sample size calculations that visual scientists encounter when designing experiments of normal and non-Gaussian sample distributions.

^{1}Ledolter and Swersey,

^{2}and Montgomery.

^{3}The seminal contributions of Fisher

^{4}have shaped this field. The following are important statistical design principles:

- •Replication. Observing a certain result just once or twice does not make it reliable. Natural variability is present everywhere; results from repeated trials on the same subject vary, and results from trials on different subjects vary even more. Replicating the experiment increases the reliability and rigor of the results.
- •Randomization. Allocation of treatments randomly to experimental subjects ensures the validity of an inference in the presence of unspecified disturbances by making certain that the risk of such disturbances is spread evenly among the treatment groups. For example, in visual science, this might entail assigning different treatments to groups of patients at random, or randomizing the order of an assigned treatment to a group of subjects or eyes. Without randomization, treatment differences may be confounded with other variables that are not controlled by the experimenter. Anticipating confounding variables in advance can help tremendously in randomizing subjects based on these variables, such as age, sex, animals from the same litter or cage, severity of disease, or level at baseline.
- •Blocking. Randomizing the assignment of treatments to subjects or eyes is important as it spreads the existing variability among subjects equitably across all treatments. However, the experimenter can do considerably better if the experimental subjects can be grouped into blocks, such that units are homogenous within the same block, but different across blocks. In the visual sciences, eyes can be blocked by subjects, or in the case of mice, by cage or litter. Responses on eyes from different subjects vary considerably, whereas the responses on eyes from the same subject are usually related with much smaller variability. When studying effects, one frequently treats only one eye, whereas keeping the other eye as a within-subject control. This approach assumes that the treatment will only affect the eye that receives it, which may not be the case in every situation. If the effect is restricted to the treated eye, then the large subject effect that affects both eyes in a similar way can be removed, resulting in an increase of the precision of the comparison, potentially making it more sensitive to detecting an effect, if one exists. Also, it is more efficient to design the experiment such that each treatment is applied to the same subject (or eye) at different time points in which the effect of the first treatment is no longer present and will not affect the second treatment effect. The consecutive arrangement of the treatments can always be randomized to make sure that treatment effects are not compromised by the order. A within-subject comparison of the effectiveness of a treatment or a drug is subject to fewer interfering variables than a comparison across different subjects.
- •A multifactor design should be considered, instead of a one factor at-a-time experimental approach. A common, but inefficient, approach to studying the effects of several factors is to carry out successive experiments in which the levels of each factor are changed one at-a-time. Fisher
^{4}showed that a better approach is to vary the factors simultaneously and to study the response at each possible factor-level combination. Such approach makes it possible to learn about interaction effects (e.g., whether the effect on the response when changing one factor depends on the level of another factor). - •Sequential approach to experimentation. Each experiment contributes to one’s understanding. The results of one experiment are critical to determine the next experimental steps. Hence only a portion of the overall research plan and budget should be spent on the initial experiment.

^{4}page 217).

^{5}Lenth's sample size applets (they are free, good, and easy to use) cover many different situations, including continuous outcome variables (with an emphasis on means and variances), categorical outcome variables (with an emphasis on proportions), and correlations. G*Power, developed by Faul et al.,

^{6}

^{,}

^{7}is another free software program available for both Macintosh and PC platforms.

^{8}

*H*

_{0}: μ = μ

_{0}against the one-sided (lower-tailed) alternative hypothesis

*H*

_{1}: μ < μ

_{0}. We test the research hypothesis whether or not an intervention reduces the mean from its current known value μ

_{0}. When determining the appropriate sample size, we need to specify values for the four following items:

- • \(\sigma = \sqrt {Var(Y} )\), the
**SD**(standard deviation = square root of the variance) of the normally distributed measurement variable*Y*. Prior data in the literature or pilot data provide a planning value for the SD. - • The
**significance level**(that is, the probability of falsely rejecting a true null hypothesis); usually α = 0.05. - • The
**power**(usually 0.80, or 80%) to detect a specified**meaningful change**(commonly referred to as**effect size**) δ = μ_{1}− μ_{0}< 0. β = 1 −*Power*(here β = 1 − 0.8 = 0.2) is the probability of a type II error of accepting the null hypothesis*H*_{0}if the mean has indeed shifted to μ_{1}= μ_{0}+ δ < μ_{0}.

**Result:**The required sample size for detecting a change δ with power 1 − β is

*z*

_{α}and

*z*

_{β}are percentiles (z-scores) of the standard normal distribution; they can be looked up in normal probability tables. For 5% significance level,

*z*

_{α=0.05}= −1.645; for 80% power and type II error of 0.20,

*z*

_{β=0.20}= −0.8416.

*R*

^{2}= (δ/σ)

^{2}. The ratio

*R*= |δ|/σ expresses the size of the detectable meaningful change as a fraction of the SD; we refer to it as the standardized effect size. Figure 1 plots the sample size against

*R*, for 5% significance and three different values of power (70%, 80%, and 90%). For given

*R*, one can find graphically the sample size that is required to detect that change. Approximately 25 observations are needed to detect a change of half an SD with 80% power; fewer (19) observations are needed for 70% power, and more (35) observations are needed for 90% power.

*R*. The graph in Figure 2 shows how power decreases for decreasing standardized effect size.

**Facts to Remember.**

- • Sample size increases with power. The more power you want, the larger the sample size.
- • Sample size increases with decreasing detectable difference. The smaller the difference or effect size you expect, the larger the sample size that will be required.
- • Sample size increases proportionally to the variance. The larger the uncertainty of the outcome measurement (variability of a result), the larger the sample size must be. The sample size quadruples with a doubling of the SD.
- • Tests are typically one-sided as one expects increases (or decreases) in the mean. Two-sided tests, without preference whether the mean increases or decreases, require a larger sample size than one-sided tests. For a two-sided test, the term
*z*_{α}in the earlier noted result is replaced by*z*_{α/2}. For α = 0.05,*z*_{α/2}= −1.96.

**Example:**For the general population, mean thickness of the inner retina is known to be 100 µm, based on prior research publications. The subject variability is large, with SD of approximately 20 µm. We are interested in whether individuals from a certain ethnic group have a thinner (smaller) mean retinal thickness. How many subjects from this ethnic group need to be studied to confirm with 80% power a reduced mean thickness of 5 µm? In this case, μ

_{0}= 100 and σ = 20. For 5% significance, and 80% power to detect a reduction of δ = −5 units, we need

*n*= (

*z*

_{α}+

*z*

_{β})

^{2}(σ/δ)

^{2}= (−1.645 − 0.8416)

^{2}(20/−5)

^{2}≈ 100 individuals.

**Example:**A new glaucoma strain of mice has been developed through breeding on a Black 6 background strain (C57BL/6). The investigators are interested in a power analysis to determine how many mice of the new strain are needed to test for a significant increase in intraocular pressure (IOP). From the literature, it is known that C57BL/6 mice at 7 months of age have IOP with mean 13.3 mm Hg and SD 1.25 mm Hg. An increase in IOP in the new strain would be considered significant if it were increased by 0.5 to 13.8 mm Hg. For 5% significance, and 80% power to detect an increase of δ = 0.5mm Hg, we need \(n=(z_{\alpha}+z_{\beta}){}^2(\sigma/\delta){}^2= (-1.645-0.8416){}^2(1.25/0.5){}^2=39\) mice.

*D*=

*Y*

_{2}−

*Y*

_{1}, where

*Y*

_{2}is the response under treatment 2 and

*Y*

_{1}is the response under treatment 1. The two groups may reflect treatment and control, or after-treatment and baseline. An important aspect in the paired comparison is that both treatments are applied on the same subject, allowing us to express the treatment effect with the difference of the two measurements. After taking differences, the problem reduces to a one-sample comparison, and the previous result can be applied. All the researcher needs to provide is a planning value of the SD of the differences, \(\sigma = \sqrt {Var(D} )\), the mean of the differences under the null hypothesis μ

_{0}, and a meaningful detectable difference.

**Example:**Assume an experiment in which eyes of Black 6 mouse strain (C57BL/6) are treated with a pressure lowering eye drop. Drops are administered to one randomly selected eye of each mouse. The change in the IOP after and before treatment (D = treatment IOP – baseline IOP) reflects the effectiveness of the medication. Fortunately, a number of publications assess the variability of the difference in IOP measurements from the same eye at two different time points, and a planning value for the SD of such difference can be obtained from the literature. In the case of mice, \(\sigma = \sqrt {Var(D} ) \approx 1\) mm Hg. We wish to test whether the treatment is effective and whether the mean of treatment/baseline differences is less than 0. A reduction of 0.5 mm Hg is considered clinically significant. With this information, the number of mice should be

*n*= (

*z*

_{α}+

*z*

_{β})

^{2}(σ/δ)

^{2}= (−1.645 − 0.8416)

^{2}(1/−0.5)

^{2}≈ 25.

*H*

_{0}: μ

_{2}− μ

_{1}= 0 against the one-sided (lower-tailed) alternative

*H*

_{1}: μ

_{2}− μ

_{1}< 0. Both group means, μ

_{1}and μ

_{2}, are unknown and must be estimated from the sampled data. This makes the problem different from the one-sample situation discussed previously, in which one mean is known with certainty. For a two-sample comparison, we need to specify values for the following five quantities:

- • σ
_{1}and σ_{2}:**two SDs**that need not be equal - •
**significance level**; usually α = 0.05 - •
**power**(usually 0.80) to detect a**specified meaningful difference**(effect size) δ = μ_{2}− μ_{1}< 0

**Result 1:**(Ledolter

^{8}). The required total sample size (for groups 1 and 2 together) is

*n*

_{1}and

*n*

_{2}, must be selected proportional to the SDs: \({n_1} = \frac{{{\sigma _1}}}{{{\sigma _1} + {\sigma _2}}}N\) and \({n_2} = \frac{{{\sigma _2}}}{{{\sigma _1} + {\sigma _2}}}N\).

**Result 2:**When the SDs are the same (σ

_{1}= σ

_{2}= σ), the sample size for either of the two groups is

*n*= 2(

*z*

_{α}+

*z*

_{β})

^{2}(σ/δ)

^{2}, for a combined sample size of

*N*= 2

*n*= 4(

*z*

_{α}+

*z*

_{β})

^{2}(σ/δ)

^{2}.

**Example:**Ledolter and Kardon

^{9}studied the average retinal nerve fiber layer (RNFL) thickness from an optic disc scan for both normal subjects and glaucoma patients on optimal treatment. As expected, the average RNFL thickness of normal subjects was considerably larger than that of glaucoma subjects. They also found that the variability in RNFL thickness among glaucoma patients (SD = 10 µm) was larger than that of normal subjects (SD = 8.5 µm). The larger SD of the glaucoma group is expected because there is a large range of disease severity and response to treatment affecting the thickness of the RNFL.

_{1}= 8.5 µm; for group 2 of glaucoma patients, σ

_{1}= 10 µm. For a detectable difference of interest δ = −5 µm, 80% power (β = 0.20), and significance level α = 0.05, the combined sample size from Result 1 is

*N*= (

*z*

_{α}+

*z*

_{β})

^{2}[(σ

_{1}+ σ

_{2})/δ]

^{2}= (−1.645 − 0.8416)

^{2}[(8.5 + 10)/−5]

^{2}≈ 85. We should sample (8.5/18.5)85 = 39 healthy subjects and (10/18.5)85 = 46 glaucoma patients. We should sample more glaucoma patients as their variability is larger.

*n*= 2(

*z*

_{α}+

*z*

_{β})

^{2}(σ/δ)

^{2}= 2(−1.645 − 0.8416)

^{2}(10/−5)

^{2}≈ 50, for a combined sample size of 100.

**Example:**A second example from the visual science considers an experiment that investigates whether a topical medication can reduce the IOP. The experiment compares a group of mice receiving the medication with another group of mice of the same strain receiving a placebo drop. The two groups are matched on similar levels of IOPs; effectiveness is measured by changes in IOP from baseline prior to receiving the treatment or placebo. We compare two groups: group 1 consisting of mice receiving the placebo, and group 2 receiving the treatment. The SD of differences in IOP taken on the same subject at different times is 1.16 mm Hg, and there are good reasons to assume that the SDs in the treatment and the placebo groups are about the same. If we want 80% power to detect a mean change of IOP of 0.5 mm Hg,

*n*= 2(

*z*

_{α}+

*z*

_{β})

^{2}(σ/δ)

^{2}= 2(−1.645 − 0.8416)

^{2}(1.16/0.5)

^{2}≈ 34, for a combined sample size of 68.

*f*= 0.2, or a 30% decrease in the mean when

*f*= −0.3. The measurements

*Y*for each of the two groups are often log-normally distributed (the logarithm of the data sample transforms it to a normal distribution), with different means but equal coefficients of variation \(c = \frac{{\sqrt {Var(Y)} }}{{E(Y)}}\).

*c*.

**Result:**(Van Belle and Martin

^{10}). The objective is to detect a 100

*f*percent proportionate change in the means, and to do so with power 1 − β. For two log-normal distributions with equal coefficients of variation

*c*, the number of observations needed in each group is

**Example:**Activation of neurons by sensory stimuli follow a proportional law (referred to as the Weber-Fechner law

^{11}

^{,}

^{12}), and measures of sensitivity to stimuli tend to follow log-normal distributions. We are comparing two groups of mice: a normal group and one with a new, genetically engineered form of retinitis pigmentosa with damage to the rods and cones. The mice in each of the two groups are exposed to a series of different stimuli differing in light intensity, and the amplitude of the electroretinogram (ERG) is recorded in response to a flash of light at each intensity. Amplitudes at each intensity follow log-normal distributions with coefficient of variation

*c*= 0.30.

^{13}We expect that the ERG response in the normal group will be larger than that of the retinitis pigmentosa group. We want 80% power to detect a 20% greater ERG response in the mean of the normal group. For

*c*= 0.30, \(\sqrt {\log (1 + {{(0.30)}^2})} = 0.2936\); for

*f*= 0.20, log(1+ 0.2) = 0.1823. We need

*n*= 2(−1.645 − 0.8416)

^{2}(0.2936/0.1823)

^{2}= 32 mice in each group, for a total of 64 mice.

*n*= 2(

*z*

_{α}+

*z*

_{β})

^{2}(σ/δ)

^{2}; see our earlier Result 2. This result assumes that the two treatments are assigned to the experimental units (e.g., subjects, mice, and others) at random. However, sometimes the randomization is carried out on

**clusters**that consist of groupings of the experimental units. Clusters may be cages of animals, and experimental units could be mice. Clusters may be patients, and experimental units could be eyes. The randomization is at the cluster level: the treatment groups (experimental and control) are assigned to clusters at random, and each of

*m*experimental units in a cluster is assigned to the same treatment. Although the data of interest comes from the experimental units in the two experimental groups, the randomization is carried out on the clusters. In the example with patient eyes, we may assign

*n*= 10 patients each to one of the two treatments, for a total of 20 patients. For cluster size

*m*= 2, this generates a total of 40 eyes, with 20 eyes for each treatment.

*m*observations in a cluster do not carry the same weight as

*m*independent observations. For the retinal thickness example in Ledolter et al.,

^{14}the intracluster correlation is approximately 0.8.

**Discussion:**The intracluster correlation inflates the sample size that we obtain under complete random sampling, 2(

*z*

_{α}+

*z*

_{β})

^{2}(σ/δ)

^{2}, by the factor [1 + (

*m*− 1)ρ]. For ρ = 0, we are back to our earlier Result 2. For ρ = 1, we must multiply the sample size that we obtain under complete random sampling by the number of experimental units in the cluster (

*m*). Here experimental units in a cluster are carbon-copies of each other. The

*m*experimental units in a cluster basically count as one unit (and not as

*m*).

*n*=

*km*= 2(

*z*

_{α}+

*z*

_{β})

^{2}(σ/δ)

^{2}[1 + (

*m*− 1)ρ] = 100 eyes, you take 50 subjects and analyze both of their eyes. It would be wrong to ignore the intracluster correlation and calculate the number of eyes from

*n*=

*km*= 100/[1 + (

*m*− 1)ρ] = 56.55, taking only 28 subjects with their 56 eyes.

*t*-test) if the underlying distributions are in fact normal, and being less efficient implies that the sample size must be increased to achieve the same power. For the two-sample comparison, Lehman

^{15}shows that in most situations the sample sizes derived for the parametric test should be increased by approximately 15%. If one plans to use a nonparametric test, a good rule of thumb adds approximately 15% to the sample size that is required for the parametric test.

^{3}), power analysis software is readily available (see Appendix). The book by Cohen

^{16}is another source for formulas, tables, and much useful practical discussion.

**J. Ledolter**, None;

**R.H. Kardon**, None

*Statistics for Experimenters: Design, Innovation, and Discovery*. 2nd ed. New York: Wiley & Sons; 2005.

*Testing 1-2-3: Experimental Design with Applications in Marketing and Service Operations*. Stanford, CA: Stanford University Press; 2007.

*Design and Analysis of Experiments*. 8th ed. New York: Wiley & Sons; 2012.

*The Design of Experiments. Edinburgh: Oliver and Boyd, 1935 (various later editions, such as 9th ed)*. New York: Macmillan Publishing Company; 1971.

*Behav Res Methods*. 2009; 41: 1149–1160. [CrossRef] [PubMed]

*J Econ Manag*. 2013; 9: 271–290.

*Trans Vis Sci Tech*. 2018; 7(5): 34, https://doi.org/10.1167/tvst.7.5.34. [CrossRef]

*Am Stat*. 1993; 47: 165–167.

*Trends Cogn Sci*. 2019; 23: 906–908. [CrossRef] [PubMed]

*Nat Neurosci*. 2019; 22: 1493–1502. [CrossRef] [PubMed]

*PLoS Genet*. 2015; 11: e1005723. [CrossRef] [PubMed]

*Invest Ophthalmol Vis Sci*. 2020; 61(6): 25, https://doi.org/10.1167/iovs.61.6.25. [CrossRef] [PubMed]

*Nonparametrics: Statistical Methods Based on Ranks*. Revised 1st ed. Upper Saddle River, NJ: Prentice Hall; 1998: 76–81.

*Statistical Power Analysis for the Behavioral Sciences*. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates; 1988.

**Minitab**(State College, PA, USA); https://www.sas.com/

**The R Project for Statistical Computing;**https://www.r-project.org/

- • built-in R Functions in library(stats)
- • library(pwr), which implements power analysis procedures as outlined in Cohen
^{16} - • several other specialized power analysis packages

**SAS**(Cary, NC, USA); https://www.sas.com/

**Lenth RV**

**:**Department of Statistics and Actuarial Science, University of Iowa, Iowa City, IA, USA; http://www.stat.uiowa.edu/∼rlenth/Power

*Am Stat*. 2001;55:187–193.