**Purpose**:
To describe and demonstrate appropriate statistical approaches for estimating sensitivity, specificity, predictive values and their 95% confidence intervals (95% CI) for correlated eye data.

**Methods**:
We described generalized estimating equations (GEE) and cluster bootstrap to account for inter-eye correlation and applied them for analyzing the data from a clinical study of telemedicine for the detection of retinopathy of prematurity (ROP).

**Results**:
Among 100 infants (200 eyes) selected for analysis, 20 infants had referral-warranted ROP (RW-ROP) in both eyes and 9 infants with RW-ROP only in one eye based on clinical eye examination. In the per-eye analysis that included both eyes of an infant, the image evaluation for RW-ROP had sensitivity of 83.7% and specificity of 86.8%. The 95% CI's from the naïve approach that ignored the inter-eye correlation were narrower than those of the GEE approach and cluster bootstrap for both sensitivity (width of 95% CI: 22.4% vs. 23.2% vs. 23.9%) and specificity (11.4% vs. 12.5% vs. 11.6%). The 95% CIs for sensitivity and specificity calculated from left eyes and right eyes separately were wider (35.2% and 30.8% respectively for sensitivity, 25.4% and 17.3% respectively for specificity).

**Conclusions**:
When an ocular test is performed in both eyes of some or all of the study subjects, the statistical analyses are best performed at the eye-level and account for the inter-eye correlation by using either the GEE or cluster bootstrap. Ignoring the inter-eye correlation results in 95% CIs that are inappropriately narrow and analyzing data from two eyes separately are not efficient.

^{1}Because ocular measures are commonly taken from both eyes of a subject, thereby generating correlated eye data, statistical analyses for evaluating the accuracy of the ocular test need to account for the correlation. In this paper, we describe and demonstrate appropriate statistical approaches for estimating these performance indices and their 95% confidence intervals (CIs). In addition, we consider whether the presence of the condition should be evaluated per subject or per eye.

^{2}

*(Se) =P(T+|D+) = n*) or, in words, the probability of a positive test result given that the disease is present. Specificity is the test's ability to exclude the disease when the disease is absent (i.e. Specificity

_{11}/ n_{1}*(Sp) = P(T-|D-) = n*) or, in words, the probability of a negative test result given that the disease is absent.

_{00}/ n_{0}^{3}:

*n*

_{1}×

*Se*× (1 −

*Se*) or

*n*

_{0}×

*Sp*× (1 −

*Sp*) is less than 5), the normal approximation may not be accurate.

^{4}Other methods,

^{4}such as the Clopper-Pearson exact method or the Wilson method, should be used to provide better accuracy and to guarantee the 95% CIs are within the desired range of 0 to 1.

^{5}The Clopper–Pearson interval provides an exact interval because it is directly based on the cumulative probabilities of the binomial distribution rather than an approximation to the binomial distribution. The Clopper–Pearson interval never has less than the nominal coverage (e.g. 95%), so it is usually conservative.

^{5}The Wilson interval is an improvement over the normal approximation interval in that the actual coverage probability is closer to the nominal value. The Wilson method has good properties even for a small number of observations and/or an extreme alpha error level. Clopper-Pearson, Wilson, and other alternative intervals are available in most statistical packages, and further details on their implementation and performance are described elsewhere.

^{4}

^{,}

^{5}

*Se*) and specificity (

*Sp*) estimated from a case-control study can be applied to calculate the PPV and NPV of a test in a target population with disease prevalence (

*P*), which is usually estimated from a separate study.

^{6}:

*P*is the prevalence of the disease of interest (assumed known),

*Se*and

*Sp*are the sensitivity and specificity of the test for detecting the disease of interest,

*n*and

_{1}*n*are the number of subjects with and without disease in the study for calculating the sensitivity and specificity, respectively.

_{0}^{6}as described below, can be used to calculate the 95% CIs to guarantee that they fall between 0 and 1.

^{7}In applying the GEE approach to estimating sensitivity and specificity, the ocular test result for each eye (T+ or T-) is modeled as the outcome variable, the variable for true eye disease status (D+ or D-) from the reference standard procedure is considered as a predictor, and the logit link is used. By convention, a positive test result is assigned a value of 1 and a negative value is assigned a value of 0, and likewise for disease presence. One way to use the GEE approach is to specify in the statistical software code that the data are “independent” and rely on the approach's robust estimator to provide accurate variance estimates to be used for calculation of 95% CIs. This specification is often the default option for procedures using GEE. Although this appears to be an incorrect choice for correlated data, this method works well for the case of modeling a 2 × 2 table. More detailed descriptions of the GEE method for accounting for inter-eye correlation in analyzing categorical ocular measures may be found elsewhere.

^{8}The SAS code for the calculation of the 95% CI of sensitivity and specificity using GEE is given in Appendix 2. Of note, in fitting GEE using PROC GENMOD in SAS, the DESCENDING option was specified so that it models the probability of disease. In R, GEE modeling can be performed by using the function geeglm() of the “geepack” package or using the function gee() of the “GEE” package. When running these GEE functions in R, it is important to first sort the data by subject ID so that data from two eyes of the same subject are adjacent to each other; otherwise, the data from the two eyes of a subject will be analyzed as independent. In SAS, sorting the data by subject ID is not needed for GEE.

^{9}Bootstrapping is a resampling technique involving computing a statistic of interest (e.g. sensitivity, specificity, predictive values, etc.) repeatedly based on a large number of random samples drawn from the original sample, so that the variability of the statistic of interest can be determined. The bootstrap provides a way to draw probability-based, assumption-free inference for a statistic of interest.

^{10}Operationally, bootstrapping involves repeatedly taking a random sample of size

*n*with replacement from an original sample of size

*n*, and computing a statistic of interest θ (e.g. sensitivity, specificity, and predictive values). Because the sampling is done with replacement, some observations may appear more than once and other observations may not be selected. The process of drawing a new sample and computing the statistic of interest is performed

*B*times (e.g. 1000 times) to generate

*B*estimates of θ. From this large number of θ estimates, the median is taken as the estimate of θ and the nonparametric CIs (e.g., 95% CI) use the 2.5th and 97.5th percentiles of the ordered distribution of the θ

*s*.

^{11}For each subject selected from sampling with replacement, all eligible eyes of the selected subjects are included in the bootstrapped sample. The desired statistic is computed using the bootstrapped sample and the process is repeated

*B*times. The nonparametric CIs can be derived in the same way as the standard bootstrapping procedure. The SAS code for the cluster bootstrap for sensitivity and specificity is given in Appendix 3.

^{12}The study enrolled 1257 premature infants and each infant underwent a regularly scheduled diagnostic examination by an ophthalmologist and digital imaging by a nonphysician imager. Ophthalmologists documented findings consistent with RW-ROP (defined as presence of either zone I ROP, ROP stage 3 or higher, or plus disease). Masked nonphysician readers graded a standard 6-image set per eye for ROP stage, zone, and presence of plus disease. The validity of the telemedicine system was evaluated using sensitivity and specificity by comparing the image evaluation (ocular test) findings to the ophthalmologist clinical examination findings (reference standard).

^{12}For the per-eye analysis, the inter-eye correlation was accommodated by using both GEE and cluster bootstrap approaches. In the cluster bootstrap, because each infant contributed both eyes for the study, infants were divided into 3 strata including 1 stratum for 71 infants without RW-ROP in both eyes, a second stratum for 9 infants with RW-ROP only in 1 eye, and a third stratum for 20 infants with RW-ROP in both eyes. If some infants had only contributed one eye to the study, two additional strata would be formed (e.g. one stratum for infants without RW-ROP in the study eye and another stratum for infants with RW-ROP in the study eye). The SAS code for these analyses can be found in Appendix 2 for the GEE approach, and Appendix 3 for the cluster bootstrap approach.

^{13}

**G.-S. Ying**, None;

**M.G. Maguire**, None;

**R.J. Glynn**, None;

**B. Rosner**, None

*Statistical Methods in Diagnostic Medicine*. New York, NY: John Wiley & Sons Inc.; 2002.

*BMJ*. 1994; 308: 1552. [CrossRef] [PubMed]

*Statistical Methods for Rates and Proportions*. 2nd ed. New York, NY: Wiley-Inter-Science; 1981.

*Statistical Science*. 2001; 16: 101–117.

*Stat Med*. 1998; 17: 2635–2650. [CrossRef] [PubMed]

*Stat Med*. 2007; 26: 2170–2183. [CrossRef] [PubMed]

*Biometrika*. 1986; 73: 13–22. [CrossRef]

*Ophthalmic Epidemiol*. 2018; 25: 1–12. [CrossRef] [PubMed]

*J R Stat Soc Series B Stat Methodol*. 2007; 69: 369–390. [CrossRef]

*Am Statistician*. 1983; 37: 36–48.

*Bootstrap method and their applications*. Cambridge: Cambridge University Press; 2000.

*JAMA Ophthalmol*. 2014; 132: 1178–1184. [CrossRef] [PubMed]

*Investig Ophthalmol Vis Sci*. 2020; 61: 27. [CrossRef]