Author Response: Retinal Nerve Fiber Layer Thickness Measurements in Rats with the Spectral Domain Optical Coherence Tomography

Katharina Lange; Mathias Bähr; Katharina Hein

doi:10.1167/iovs.12-10024

Introduction

We would like to thank Huang and colleagues¹ for interesting comments, and we will attempt in the following to address all the important issues.

We agree with Huang and colleagues¹ that the description “inter/intraobserver reproducibility” is the more common term. However, in our study, the term “reader” was used in accordance with the EMA guideline on clinical evaluation of diagnostic agents² and the corresponding FDA guidance.³ Here the terms “inter/intrareader variability” and “inter/intraobserver reproducibility” are used because the inter/intrareader variability is determined to estimate the inter/intraobserver reproducibility. In our work, the influence of the reader is assessed by a factorial design (e.g., as proposed previously^4,5), whereby the reader is regarded as a fixed factor; therefore, the terms reader and observer can be used interchangeably.

In the literature, the ANOVA is proposed as a standard method to analyze multiple reader diagnostic trials.^4,5 In these approaches, the reader is regarded as a factor in a multifactorial design and the consistency between the readers is assessed by testing whether the diagnostic accuracy (mostly measured by sensitivity, specificity,⁶ or the area under the ROC-curve [AUC]^4,5) depends on the reader. We are aware of the fact that the design of our study differs from the design of a usual diagnostic trial, but nevertheless, the two-factorial ANOVA (with the reader as factor) seems a good adaption of the approaches proposed in the literature (due to a non-binary gold standard, the quality of the diagnostic agent [the spectral domain optical coherence tomography (OCT) measurements] cannot determined by the usual assessments of diagnostic accuracy [such as sensitivity, specificity, or the AUC]. Furthermore, a comparison between the OCT measurement and the histomorphometric measurement [which might be regarded as our gold standard] could only be made after the animal was killed and hence the consistency between those two measurements can only be determined at one time-point [see Table 1; the grey data cannot be used for this comparison]). Furthermore, this approach has several advantages compared with the usually determined parameters. First, by using a factorial design, the influence of the reader is not only measured (as done by the coefficient of variation, the ICC, and test-retest variability), but also statistically tested. Furthermore, this approach allows us to discriminate between two different effects: the main effect (reader) and the interaction (reader × time point). Whereas the main effect deals with the question, whether the readers work on the same absolute level, the interaction-term can be used to answer an even more important question: Do the readers observe the same progression of disease? Second, with the analysis of variance, we can accurately fit a model to our data. This is of great value, considering the complex data structure (see Table 1). For example, if the intraclass correlation coefficient (ICC) is determined, several statistical issues have to be discussed:

Table 1.

View Table

Design of the Study

Table 1.

Design of the Study

	Baseline						d7pi						EAEd1						EAEd8						EAEd14
	Left			Right			Left			Right			Left			Right			Left			Right			Left			Right
	1	2	H	1	2	H	1	2	H	1	2	H	1	2	H	1	2	H	1	2	H	1	2	H	1	2	H	1	2	H
1	x	x	-	x	x	-	x	x	-	x	x	-	x	x	-	x	x	-	x	x	-	x	x	-	x	x	x	x	x	x
…
6	x	x	-	x	x	-	x	x	-	x	x	-	x	x	-	x	x	-	x	x	-	x	x	-	x	x	x	x	x	x
7	x	x	-	x	x	-	x	x	-	x	x	-	x	x	-	x	x	-	x	x	x	x	x	x
…
11	x	x	-	x	x	-	x	x	-	x	x	-	x	x	-	x	x	-	x	x	x	x	x	x
12	x	x	-	x	x	-	x	x	-	x	x	-	x	x	x	x	x	x
…
16	x	x	-	x	x	-	x	x	-	x	x	-	x	x	x	x	x	x
17	x	x	-	x	x	-	x	x	x	x	x	x
…
20	x	x	-	x	x	-	x	x	x	x	x	x

1, reader 1; 2, reader 2; H, histomorphometric measurements.

The ICC compares the intrasubject variation with the intersubject variation in order to assess the influence of the within-subject variance (which can be traced back to any inconsistency between the readers). In our case, there are several questions to be answered in order to calculate this value: Should the total variance be taken as the reference value? This answers the question whether the within-subject variation can be neglected compared with the time effect (i.e., the progression of disease). Or should the ICC be calculated for each time-point separately? This would answer the question whether the influence of the reader can be neglected compared with the usual variation between two subjects having the same state of disease.
As most animals were observed several times the variance estimators have to be adapted to a possible correlation. Furthermore, the correlation between both eyes has to be considered. If the correlation is ignored, the estimator of the variance might be biased.

Hence, if the ICC is calculated, this has to be performed by means of a linear model, which is comparable to the ANOVA model, we have already fitted.⁷ In Table 2, the results of the analysis for each time point can be found. The results can be summarized by an ICC of 0.797 (consistency version), (for the differences between the consistency and the agreement version of the ICC, see example⁸). In order to compare the within-subject variation with the total variation (including the variation induced by the progression of disease), we furthermore calculated the global ICC (ICC_global = 0.907, consistency version). In accordance with⁹ “an ICC of less than 0.40 indicates poor reproducibility; of 0.40 to 0.75, fair to good reproducibility; and of greater than 0.75, excellent reproducibility.” As the calculation of the ICC and the ANOVA performed in our article is based on comparable linear models, these results underline the conclusions drawn from the analysis of variance presented in our study in the article: The progression of disease can consistently be detected, whereas we achieve no absolute agreement between the two readers.

Table 2.

View Table

Intraclass Correlation Coefficients for the Interobserver Reproducibility

Table 2.

Intraclass Correlation Coefficients for the Interobserver Reproducibility

	ICC
	Consistency	Agreement
Baseline	0.792	0.778
d7pi	0.801	0.529
EAEd1	0.780	0.518
EAEd8	0.813	0.567
EAEd14	0.795	0.701

Considering the intraobserver reproducibility, we calculated the within-subject variation (i.e., the variability between the different reads of the same observer). As mentioned in the article, this standard deviation is estimated to be 0.71 μm. Unlike the reader, the repeated measures would have to be modeled as a random factor, but not as a fixed one in the respective linear model. Therefore, we decided not to calculate the F-statistic, but to calculate part of the variance components found in the usual F-statistics. Moreover, we further determined the total standard deviation, where all reads of the same observer (for all time points) were considered (5.53 μm). By this descriptive analysis, the general reader of the article should get an impression of the variability between different reads compared with the variability traced back to the progression of disease. As we achieved no absolute agreement between the two readers but could only consistently detect the progression of disease, we decided that it would be sufficient to take the global variance as reference value for the intraobserver reproducibility as well. The global ICC (consistency version) can be calculated from the data provided in the article:

As already mentioned in the paper, we herewith draw the conclusion that the intrareader variability can easily be neglected compared with the variability that is traced back to the progression of disease. Nonetheless, we are pleased to provide more information considering the intraobserver reproducibility. In accordance with the extra-analysis provided for the interobserver reproducibility, we calculated the ICC for the intraobserver reproducibility as well (see Table 3). Table 3 shows that we even achieve an absolute agreement for each time point. Therefore, the variability between two readings of the same reader can be neglected compared with the usual variability between different subjects with the same state of disease.

Table 3.

View Table

Intraclass Correlation Coefficients for the Intraobserver Reproducibility

Table 3.

Intraclass Correlation Coefficients for the Intraobserver Reproducibility

	ICC
	Consistency	Agreement
Baseline	0.953	0.909
d7pi	0.980	0.954
EAEd1	0.960	0.942
EAEd8	0.936	0.908
EAEd14	0.953	0.953

For analyzing the data of OCT versus histomorphometric measurements, we performed a Bland-Altman plot from which we draw the same conclusions as Huang and colleagues,¹ despite using a different statistical method. Hence, for better understanding for the general reader and due to the limited space of the journal, we decided to include only the Pearson's correlation in our manuscript. Note that the data obtained from histomorphometric analysis showed smaller retinal nerve fiber layer thickness in all eyes compared with the OCT measurements in our study. As already mentioned in the manuscript, this observation can be explained by the tissue shrinkage occurring during the fixation procedure. Therefore, the comparison between OCT measurement and histomorphometric analysis can only lead to consistency but not to total agreement. For the sake of completeness, we further determined the consistency version of the ICC for this data as well. Hereby an ICC of 0.726 indicates a good agreement considering the extra variance traced back to tissue shrinkage.

References

Huang J Savini G Feng Y Wang Q . Retinal nerve fiber layer thickness measurements in rats with spectral domain-optical coherence tomography. Invest Ophthalmol Vis Sci . 2012;53:749–750. [CrossRef] [PubMed]

Guideline on Clinical Evaluation of Diagnostic Agents . Doc. Ref. CPMP/EWP/1119/98/Rev 1 , London European Medicines Agency , 2009

U.S. Food and Drug Administration (FDA) Guidance for Industry: Developing Medical Imaging Drug and Biological Products—Part 3: Design, Analysis, and Interpretation of Clinical Studies . Rockville, MD FDA ; 2004

Obuchowski NA . New methodological tools for multiple-reader ROC studies. Radiology . 2007;243:10–12. [CrossRef] [PubMed]

Obuchowski NA Beiden SV Berbaum KS Multireader, multicase receiver operating characteristic analysis: an empirical comparison of five methods. Acad Radiol . 2004;11:980–995. [PubMed]

Lange K Brunner E . Sensitivity, specificity and ROC-curves in multiple reader diagnostic trials—a unified, nonparametric approach. Statistical Methodology . 2012;9:490–500. [CrossRef]

Dickey DA . PROC MIXED: Underlying Ideas with Examples . SAS Global Forum San Antonio, TX , March 16–19 , 2008 Paper 374 - 2008

Shrout PE Fleiss JL . Intraclass correlations: uses in assessing rater reliability. Psychol Bull . 1979;86:420–428. [CrossRef] [PubMed]

Cettomai D Pulicken M Gordon-Lipkin E Reproducibility of optical coherence tomography in multiple sclerosis. Arch Neurol . 2008;65:1218–1222. [PubMed]

Jump To...

Related Articles

From Other Journals

Related Topics

Jump To...

This feature is available to authenticated users only.

Related Articles

From Other Journals

Related Topics

To View More...

You must be signed into an individual account to use this feature.