The purpose of this study was to determine the between-scale agreement of grading estimates obtained with the cross-calibrated MC-D, IER, Efron, and VBR bulbar redness grading scales.
Overall, the perceived redness depended on the sample image and the reference scale that was used (
Fig. 2; RM ANOVA;
P < 0.001). The perceived redness of 6 of the 16 images was significantly different between at least two of the four grading scales (
Fig. 3). In general, sample images were perceived to be different between the IER or MC-D scale versus the Efron or VBR scale; only image 16, representing the eye that was perceived to be the most red of all images, deviated from this trend. This finding suggests that redness estimates are dependent on the dynamic range of the scale being used (
Table 1), as the scales having a shorter dynamic range (IER and MC-D) generally resulted in higher redness estimates than the scales with a wider dynamic range (Efron and VBR).
Despite these differences between single images, there was close agreement for the grading estimates between all scales (
Table 2;
Fig. 4 and
Fig. 5). There were very high levels of linear association for each combination of scales (all Pearson's
r ≥ 0.96). The ICC represents a measure of the variability of scores between test and retest sessions to the overall variability.
9,24,25,29 In this particular case, the ICC was used to quantify the reproducibility of grading estimates obtained with different scales. ICC (2,k) was selected because it estimates the agreement between assessments for a random sample of raters that can be generalized to other raters within some population, and represents an indicator of the interchangeability of the grading scales.
25 Averaged across observers, between-scale ICCs were found to be at least 0.96, indicating very low variability between grading estimates with different scales.
The CCC is a specific type of ICC that describes the departure from concordance of repeated measurements, with a CCC of 1 representing identical scores.
5,27 There was high concordance between grading estimates for each combination of scales, with levels of CCC of at least 0.93.
Figure 4 provides a qualitative representation of this relationship, and shows that there were only slight deviations from perfect concordance (dashed 45° line) for each pair of scales (solid fit line). Closer inspection shows that the higher redness for the MC-D and IER scale compared with the Efron and VBR scale appears to subside with increasing redness, as indicated by the converging solid fit line toward the 45° line of equality. Overall, the highest levels of between-scale ICC and CCC were found for the MC-D and IER scale and for the Efron and VBR scale, while combinations of scales with different dynamic ranges (e.g., Efron with MC-D) resulted in weaker correlations.
In terms of grading units, the variability of grading estimates between any pair of scales was very low, as indicated by the between-scale CORs (
Table 2) and LOAs (
Fig. 5). The between-scale LOAs show the range of grading estimates that can be expected 95% of the time when two different scales are used. It is quantified by the COR as a measure of the variability of the grades relative to the mean of the differences (đ) which indicates if there is systematic bias in the grading estimates between scales. There was a small but systematic bias toward higher grades for scales with shorter dynamic range (MC-D and IER), while scales with similar dynamic range showed no such trend (
Table 2 and
Fig. 5, dashed horizontal line). Overall, the between-scale CORs were small (indicating low variability and good repeatability) and ranged from five (IER vs. VBR) to eight grading units (IER versus Efron) for the 0 to 100 bulbar redness range. In terms of grading units, the variability of assessments did not seem to be dependent on the dynamic range of the scales; it appeared, however, that CORs were slightly higher when grading estimates with the pictorial Efron scale were compared with the photographic scales. Overall, these findings suggest that there is close agreement between the grading estimates with the newly calibrated scales. In particular, it appears that grading scales with similar dynamic range provide closer agreement of grading estimates.
There is only one study that quantitatively compared subjective grades between bulbar redness scales. Efron et al.
12 reported that the mean bulbar redness grades (across all observers) were approximately 0.6 grading units higher (for a 0 to 4 range) with the IER scale compared with the Efron scale for the same set of sample images. Proportionally, this means that grades were approximately 15% higher on average when the IER scale was used, whereas mean redness grades were only different up to 4% between any pair of the newly calibrated scales (
Table 2). In general, CORs are typically used to quantify the variability of grades for test/retest settings with a single scale,
5,9,10,12,30 while for this study CORs were calculated to estimate the differences of grading estimates between scales. This complicates a direct comparison with other studies, however, it allows to estimate how the variability between scales compares with the test/retest variability that is typically present with subjective grading. In this study, the between-scale CORs (
Table 2) were found to be similar or even smaller than within-scale test/retest CORs that were previously reported.
9,10,12 Therefore, the calibration of the grading scales appears to provide closer agreement between grading estimates than previously reported when different scales were used,
12,15 which implies that the newly calibrated grading scales may be used interchangeably. This finding may provide great potential for application in research settings. In general, if comparisons of grading estimates are required, the use of the same scale is encouraged for every clinician involved; however, this may not always be possible. Wiegleb and Sickenberger
31 have reported that different scales are popular in different parts of the world. Therefore it would be of particular benefit to researchers of geographically disparate research centers (e.g., involved in a multi-center study), who could continue using the reference images to which they are accustomed while assigning cross-calibrated reference grades which provide better agreement between scales than the original scale steps.
In conclusion, the newly calibrated grading scales were capable of producing highly reproducible redness estimates across scales. There were differences for the redness estimates between scales for some of the sample images only, and if images were found to be different, these differences appeared to be dependent on the dynamic range provided by the reference images of the respective grading scale. Redness estimates tended to be higher for scales with a comparatively short dynamic range (MC-D and IER) than found for the scales with wider dynamic ranges (Efron and VBR); scales with similar dynamic ranges showed closer agreement between grading estimates than scales with different dynamic ranges. Overall, there was very high agreement between the grading estimates of all scales, and it appears that using the newly calibrated grading scales might reduce the between-scale variability when subjectively estimating redness. The use of the newly calibrated scales in a more typical grading setting and with more experienced observers seems to be the logical next step to further evaluate this hypothesis.