We would like to thank Huang and colleagues
1 for interesting comments, and we will attempt in the following to address all the important issues.
We agree with Huang and colleagues
1 that the description “inter/intraobserver reproducibility” is the more common term. However, in our study, the term “reader” was used in accordance with the EMA guideline on clinical evaluation of diagnostic agents
2 and the corresponding FDA guidance.
3 Here the terms “inter/intrareader variability” and “inter/intraobserver reproducibility” are used because the inter/intrareader variability is determined to estimate the inter/intraobserver reproducibility. In our work, the influence of the reader is assessed by a factorial design (e.g., as proposed previously
4,5), whereby the reader is regarded as a fixed factor; therefore, the terms reader and observer can be used interchangeably.
In the literature, the ANOVA is proposed as a standard method to analyze multiple reader diagnostic trials.
4,5 In these approaches, the reader is regarded as a factor in a multifactorial design and the consistency between the readers is assessed by testing whether the diagnostic accuracy (mostly measured by sensitivity, specificity,
6 or the area under the ROC-curve [AUC]
4,5) depends on the reader. We are aware of the fact that the design of our study differs from the design of a usual diagnostic trial, but nevertheless, the two-factorial ANOVA (with the reader as factor) seems a good adaption of the approaches proposed in the literature (due to a non-binary gold standard, the quality of the diagnostic agent [the spectral domain optical coherence tomography (OCT) measurements] cannot determined by the usual assessments of diagnostic accuracy [such as sensitivity, specificity, or the AUC]. Furthermore, a comparison between the OCT measurement and the histomorphometric measurement [which might be regarded as our gold standard] could only be made after the animal was killed and hence the consistency between those two measurements can only be determined at one time-point [see
Table 1; the grey
data cannot be used for this comparison]). Furthermore, this approach has several advantages compared with the usually determined parameters. First, by using a factorial design, the influence of the reader is not only measured (as done by the coefficient of variation, the ICC, and test-retest variability), but also statistically tested.
Furthermore, this approach allows us to discriminate between two different effects: the main effect (reader) and the interaction (reader × time point). Whereas the main effect deals with the question, whether the readers work on the same absolute level, the interaction-term can be used to answer an even more important question: Do the readers observe the same progression of disease? Second, with the analysis of variance, we can accurately fit a model to our data. This is of great value, considering the complex data structure (see
Table 1). For example, if the intraclass correlation coefficient (ICC) is determined, several statistical issues have to be discussed:
Hence, if the ICC is calculated, this has to be performed by means of a linear model, which is comparable to the ANOVA model, we have already fitted.
7 In
Table 2, the results of the analysis for each time point can be found. The results can be summarized by an ICC of 0.797 (consistency version), (for the differences between the consistency and the agreement version of the ICC, see example
8). In order to compare the within-subject variation with the total variation (including the variation induced by the progression of disease), we furthermore calculated the global ICC (ICC
global = 0.907, consistency version). In accordance with
9 “an ICC of less than 0.40 indicates poor reproducibility; of 0.40 to 0.75, fair to good reproducibility; and of greater than 0.75, excellent reproducibility.” As the calculation of the ICC and the ANOVA performed in our article is based on comparable linear models, these results underline the conclusions drawn from the analysis of variance presented in our study in the article: The progression of disease can consistently be detected, whereas we achieve no absolute agreement between the two readers.
Considering the intraobserver reproducibility, we calculated the within-subject variation (i.e., the variability between the different reads of the same observer). As mentioned in the article, this standard deviation is estimated to be 0.71 μm. Unlike the reader, the repeated measures would have to be modeled as a random factor, but not as a fixed one in the respective linear model. Therefore, we decided not to calculate the F-statistic, but to calculate part of the variance components found in the usual F-statistics. Moreover, we further determined the total standard deviation, where all reads of the same observer (for all time points) were considered (5.53 μm). By this descriptive analysis, the general reader of the article should get an impression of the variability between different reads compared with the variability traced back to the progression of disease. As we achieved no absolute agreement between the two readers but could only consistently detect the progression of disease, we decided that it would be sufficient to take the global variance as reference value for the intraobserver reproducibility as well. The global ICC (consistency version) can be calculated from the data provided in the article:
As already mentioned in the paper, we herewith draw the conclusion that the intrareader variability can easily be neglected compared with the variability that is traced back to the progression of disease. Nonetheless, we are pleased to provide more information considering the intraobserver reproducibility. In accordance with the extra-analysis provided for the interobserver reproducibility, we calculated the ICC for the intraobserver reproducibility as well (see
Table 3).
Table 3 shows that we even achieve an absolute agreement for each time point. Therefore, the variability between two readings of the same reader can be neglected compared with the usual variability between different subjects with the same state of disease.
For analyzing the data of OCT versus histomorphometric measurements, we performed a Bland-Altman plot from which we draw the same conclusions as Huang and colleagues,
1 despite using a different statistical method. Hence, for better understanding for the general reader and due to the limited space of the journal, we decided to include only the Pearson's correlation in our manuscript. Note that the data obtained from histomorphometric analysis showed smaller retinal nerve fiber layer thickness in all eyes compared with the OCT measurements in our study. As already mentioned in the manuscript, this observation can be explained by the tissue shrinkage occurring during the fixation procedure. Therefore, the comparison between OCT measurement and histomorphometric analysis can only lead to consistency but not to total agreement. For the sake of completeness, we further determined the consistency version of the ICC for this data as well. Hereby an ICC of 0.726 indicates a good agreement considering the extra variance traced back to tissue shrinkage.