Reaching a consensus in terms of interchangeability and utility (i.e., disease detection/monitoring) of a medical device is the eventual aim of repeatability and agreement studies. The aim of the tolerance and relative utility indices described in this report is to provide a methodology to compare change in clinical measurement noise between different populations (repeatability) or measurement methods (agreement), so as to highlight problematic areas. No longitudinal data are required to calculate these indices. Both indices establish a metric of least to most effected across all parameters to facilitate comparison. If validated, these indices may prove useful tools when combining reports and forming the consensus required in the validation process for software updates and new medical devices.

^{1,2}and second, several different methods of the statistical analyses are reported.

^{3–6}(Typical methods are mean intrasubject SD, which is scale dependent, but confidence limits can be used to indicate significant differences; coefficient of variation [CoV], which is a percentage but reliant on underlying measurement scale, e.g., CoV of 3% is very different in central macular thickness 350 μm or in maximal corneal curvature of 45 diopters [D]; interclass correlation coefficients suffer from the same problems as CoV [see review by McAlinden et al.

^{3}].) Due to differences between study populations, results from different statistical methods cannot be transformed and combined post hoc; therefore, a consensus is rarely reached on device interchangeability or precision.

^{3}Depending on size of change in repeatability, a much smaller sample size may be sufficient to demonstrate a significant difference. For example, the significantly worse repeatability of maximal corneal curvature in keratoconus eyes than in the healthy eyes with the Pentacam device (Oculus; Optikgeräte GmbH, Wetzlar, Germany) was recently highlighted,

^{1,4}and this result had clinical impact with respect to progression detection and the choice of treatment with collagen cross linking.

^{1}The many different measurement scales (e.g., μm, D) hindered comparison of repeatability across the 20+ parameters used clinically to monitor keratoconus.

^{1}This difficulty motivated us to create a single index that was scale independent and that indicates at a glance whether repeatability of any parameter in pathological eyes is significantly different from in healthy eyes. We have named this the “tolerance index,” and methodology to calculate this index is given within this report. This tolerance index also may be useful in interdevice agreement studies, as these studies assess the measurement noise due to change of device, which includes the intradevice noise (repeatability). The question of “agreement” can be reduced to the following: are the limits of agreement (interdevice noise) significantly wider than the limits of repeatability (intradevice noise)? In this case, the methodology is similar to that described above and the tolerance index is applicable. Worked examples of the tolerance index for both agreement and repeatability are given within this article.

^{1}raised the question, what is the effect of these changes in measurement noise on the utility of the parameter? To answer this question, we have created a measure on the utility of a parameter, named the “relative utility index” (RU). The RU is a ratio of the range of each parameter (e.g., maximal corneal curvature [K-max] ranges from 45 D to 65 D) and the measurement noise; so as the noise (repeatability limits) increases the RU decreases. In other words, the path from health to advanced disease can be partitioned into steps, the size of the step depends on the amount of measurement noise; therefore, RU is an estimate of number of distinguishable stages of disease severity. Theoretically, the more stages that are distinguishable, the better the parameter is at detecting progression. Alongside the tolerance index described above, the details on how to calculate RU index are given within this report, with appropriate worked examples.

*Sr*) and reproducibility (

*SR*) as per the recommendations from the British Standards Institute and the International Organization for Standardization,

^{5,6}

*Sr*equals the within-subject SD for repeated measures by the same observer, which is derived by a one-way ANOVA. The repeatability limit (

*r*) is reported as

*Sr**1.96* √2 which gives the likely limits within which 95% of measurements should occur.

*Sr*and

*r*were calculated. The

*SR*and reproducibility limits (

*R*) were likewise calculated. We wish to examine if there are significant differences in repeatability and reproducibility limits between healthy and pathological subgroups.

*LoA*s) have been calculated using the Bland-Altman method. Here, generally there is some degree of systematic bias that can be corrected for, after which the width of confidence limits can be estimated; these are referred to as the

*LoA*. The width of the

*LoA*between devices is a combination of intradevice noise (repeatability) and the interdevice noise. The important question is, are the

*LoA*s significantly wider than the limits of repeatability, because if so, this represents that a change in device significantly increases the noise and decreases its utility; in short, that the devices are not interchangeable.

*r*and

_{H}*R*), and the repeatability and reproducibility limits derived from a pathological cohort (denoted

_{H}*r*and

_{P}*R*) are used to calculate the tolerance index, denoted as

_{P}*Tr*and

*TR*for repeatability and reproducibility, respectively. This is computed as log of the ratio between

*R*and

_{P}*R*for reproducibility limits and as log of the ratio between

_{H}*r*and

_{P}*r*for repeatability limits (Equation 1), where

_{H}*i*corresponds to the parameter; for example, IOP, CCT (central corneal thickness), or CMT (central macular thickness).

*LoA*and

*r*) derived from the same population are used to calculate the tolerance index for agreement, denoted

*TA*. This is computed as the log of the ratio between

*LoA*and

*r*(Equation 2).

*P*

_{1}and

*P*

_{2}. If we do not wish for the confidence limits to overlap, then the ratio between the two limits must be greater than the sum of the following: 1 ( = perfect agreement) +

*P*

_{1}+

*P*

_{2}. The tolerance limit will correspond to the Log

*of this sum. In Table 1, the explicit cutoffs for a range of sample sizes are given. Clearly, as the*

_{n}*n*increases, both

*P*

_{1}and

*P*

_{2}will decrease, so the tolerance limits will approach 0. In practice, a tolerance limit of 0.45 will be greater than all tolerance limits of sample sizes of 15 or more.

**Table 1**

*RU*, we make the assumption that all differences observed between any two measures are due to the instrument, including misalignment errors (within-subject SD;

*Sr*), the observer (between observer SD;

_{i}*SR*), or structural differences between patients (between subject SD;

_{i}*SP*). The relative utility (

_{i}*RU*) is the ratio between the signal (

_{i}*SP*) and the total variability (

*TV*=

*RU*scale goes from 0 to 1, with low potential nearer 0 and high potential nearer to 1.

*RU*index could be expanded to examine the effect of a change of device (between device SD,

*SDev*). Mathematically, this introduces a fourth parameter in the

_{i}*TV*=

*RU*index versus interdevice

*RU*index could further help identify where instruments are not interchangeable.

*RU*indices.

*K-max*with the Pentacam device. This was based on the limits of reproducibility (0.8 D) established in 100 healthy eyes, as reported by McAlinden et al.

^{4}A report on the limits of reproducibility in 32 keratoconus eyes, demonstrated a clear difference in repeatability of K-max (2.3 D).

^{1}In this study by Hashemi et al.,

^{1}three images were taken by two observers. The tolerance can be calculated on the reproducibility limits of a single image taken by each observer (

*TR1*), average of pairs of images taken by each observer (

*TR2*), and average of triplets of images taken by each observer (

*TR3*). The tolerance index for reproducibility of

*K-max*,

*TR1*, is Log

_{K-max}*(2.3/0.8) = 1.1, the*

_{n}*TR1*cutoff is 0.24 (N1 = 100, N2 = 32, see Table 1). Because in this example

*TR1*> TR cutoff, this summarizes the statistically significant difference. The traditional reproducibility limits are given in the column labeled “

_{K-max}*R*” in Table 2, and there is a scale attached (mm

^{3}, D, deg, or μm). In keratoconus, up to 20 parameters are monitored by the clinician, yet no comparative interpretation can be made without background knowledge on normative limits. Tolerance (column labeled

*TR1*) provides an easier way to note where additional care is required, as values outside normal limits (greater than the TR cutoff) are evident. Those significantly better than normal are marked in italic, those that are significantly worse are marked in bold, those similar to normal are in plain font (Table 2, TR values). The

*K-max*is clearly outside normal limits, but so also is pachymetry at thinnest corneal thickness (TCT) and Apex, corneal power in diopters (KPD), anterior chamber (AC) volume, and AC depth, with

*K-max*and pachymetry measures being the most affected.

**Table 2**

*TR1*,

*TR2*, and

*TR3*values showed that taking the average of triplets of images in keratoconus patients reduces the variability in

*K-max*to be in line with those observed in healthy eyes (

*TR1*= 1.06>0.24

*outside*CI of normal limits,

*TR2*= 0.50>0.24

*outside*CI of normal limits,

*TR3*= 0.12<0.24

*inside*CI of normal limits, Table 2). However, reproducibility of thinnest corneal thickness remains outside normative limits despite this averaging of images (Table 2). This indicates that reproducibility of TCT may be more affected by the presence of pathology than

*K-max*.

*RU*values (Table 2). Looking in detail, there is a narrow range of CCT values within the cohort (mean CCT = 497, interquartile range 470–518, SD = 36), whereas the reproducibility limits are large in comparison (

*r*= 12.3 μm,

*R*= 29.5μm). This indicates that the range (estimate of the interval containing 95% of all data) is 141 μm wide, and because

*R*is 29.5 μm, this indicates that there are four (141 μm/29.5 μm) distinguishable stages of disease severity measureable with this parameter; however, with

*K-max*in the same scenario there are six distinguishable stages of disease severity. When the average of three images is used,

*R*of CCT reduces to 15.4 μm and there are 9 distinguishable stages of disease severity (

*RU*= 0.42); however, with

*K-max*in the same scenario, there are 16 distinguishable stages of disease severity (

*RU*= 0.88). This indicates that pachymetry may not be as good a parameter to use to monitor keratoconus progression with the Pentacam device as

*K-max*. The

*RU*value summarizes this calculation for each parameter, giving a metric of least to most likely parameters to be useful.

^{7}reported

*r*= 2 μm in a cohort of 41 healthy eyes. Subsequently, Fiore et al.

^{8}reported a CoV of 3.1% in CMT of patients with diabetic macular edema (DME) in 21 eyes, whereas Comyn et al.

^{9}reported

*r*of 8 μm in CMT of DME patients corresponding to CoV of 1.6% in 50 eyes. Tr values between healthy and pathological eyes were Log

*(3.1/0.5) = 1.8 in the group reported by Fiore et al.*

_{n}^{8}and Log

*(1.6/0.5) = 1.2 in the group reported by Comyn et al.*

_{n}^{9}These

*Tr*values are greater than the respective

*Tr*cutoffs of 0.32 and 0.26 (see Table 1). This indicates that the repeatability is significantly worse in eyes with DME.

*TR*value is 0.66 (= Log

*[3.1/1.6]). Because the sample sizes are 21 and 50, respectively, the corresponding*

_{n}*TR*cutoff is 0.31. Therefore, a significant change in repeatability between cohorts with the same pathology is noted (this may be due to differences in disease severity between studies or different segmentation algorithms applied).

*RU*can be calculated for the cohort of patients with DME reported by Fiore et al.

^{8}Repeatability is higher in the macular segments A1 to A9 (as labeled in the Spectralis device) than in the foveolar estimates. This corresponds to higher

*RU*values for macular segments A1 to A9 (>0.95) than at the central foveolar measure (

*RU*= 0.76). This suggests that the foveolar estimate may be a less appropriate measure to monitor change in these patients than the macular segments.

^{10}examined agreement of the Galilei device (Ziemer Ophthalmic Systems AG, Port, Switzerland) and ultrasound in pachymetry measures of healthy eyes (

*n*= 77) and eyes that had undergone Lasik surgery (

*n*= 39). Repeatability in the Galilei device was 10.9 μm and 13.3 μm in the healthy and Lasik groups, respectively, whereas with the ultrasound, the limits of repeatability were tighter, 9.8 μm and 10.3 μm in the healthy and Lasik groups, respectively. There was no significant difference detectable in repeatability between Lasik and healthy eyes with

*Tr*values of 0.20 for the Galilei device and 0.05 for the ultrasound (the

*Tr*cutoff was 0.24).

*LoA*is 13.9 μm and 19.4 μm for the healthy and Lasik eyes, respectively.

^{12}In this case, when we compare the two devices, we are asking whether the

*LoA*is significantly wider than the limits of repeatability (

*r*). In the healthy eyes, the

*TA*cutoff is 0.20 (Table 1, sample 1 healthy eyes in Galilei and sample 2 healthy eyes in ultrasound) and the

*LoA*between ultrasound and Galilei was 13.9 μm. The limits of repeatability in the Galilei is 10.9 μm; therefore, the

*TA*= Log

_{G}*(13.9/10.9) = 0.24. This is greater than the*

_{n}*TA*cutoff, indicating that replacing the Gailei with the ultrasound significantly increases measurement noise. The limits of repeatability in the ultrasound is 9.8 μm; therefore, the

*TA*= Log

_{U}*(13.9/9.8) = 0.34. This is greater than the*

_{n}*TA*cutoff, indicating that replacing ultrasound with the Gailei significantly increases measurement noise. This means that the devices are not interchangeable in healthy eyes without significantly increasing measurement noise. Likewise, in the Lasik eyes the

*TA*cutoff is 0.36, the

*LoA*is 19.4 μm, therefore

*TA*> 0.63, indicating again the devices are not interchangeable without significantly increasing measurement noise

_{U/G}**.**

*tolerance index*highlighted those parameters with greatest and least divergence in terms of precision/agreement across the spectrum of studied parameters. This metric may help simplify repeatability, reproducibility and agreement studies. Likewise, the worked examples of the

*RU index*examined the trade-off between signal and measurement noise, by theoretically estimating the number of distinguishable stages in a given parameter. This may better identify those parameters with the highest potential to detect disease progression

*, without using longitudinal data*. Both these indices return scale independent values and facilitate comparison between parameters, which may aid the clinician to implement research findings clinically.

*Tr*(examples 2a, 3a) and

*TR*(examples 1a, 1b, 2b) values facilitate comparison across the spectrum of parameters to highlight where variability is

*least*and

*most*effected by the presence of pathology. The

*TA*(example 3b) value indicates whether the same ocular parameter is similarly affected in two different devices. These indices may simplify the argument required to form a consensus on a given parameter/device in different pathologies. For example the addition of the

*TR*indices (e.g., Table 2) would support the following statement: “comparing reproducibility in keratoconus with healthy eyes using a single image taken with Pentacam device,

*K-max*is one of the worst affected parameters (example 1a); however, if the average of three estimates is used instead, then reproducibility of

*K-max*returns to the levels observed in normal eyes, while reproducibility of TCT does not (example 1b).” In this way, tolerance could be used to demonstrate the advantages/disadvantages of a software update/new device by highlighting significant changes.

*RU*index described here gauges the inherent ability of a parameter to detect change, based on an estimation of number of “distinguishable” stages of disease severity. The relative utility index can be calculated in any cohort of patients, given the repeatability limits (

*r*) and the typical range, as described in examples 1c and 2c. Returning to the example of keratoconus, the correspondence between the RU values of example 1c and “best-performing” parameters in literature in terms of diagnostic accuracy and detecting disease progression was examined. In several studies examining diagnostic accuracy,

^{12–19}it was observed that pachymetry at center and thinnest location had good sensitivity and specificity; however, the area under the curve (

*AUROC*) is lower than that reported with asymmetry indices.

^{13,14}Comparing those parameters with >0.90

*AUCROC*values reported by Correia et al.

^{11}with those parameters with RU >0.95 in example 1c, agreement in many parameters was observed (ISV, IVA, IHD, and ectasia map indices D and Db, data not shown). Likewise, the poorest AUROC results (<0.85) reported by Uçakhan et al.

^{12}corresponded well with the poorest

*RU*values reported in example 1c (<0.8). In terms of detection of disease progression,

*K*values have been shown to be useful in distinguishing between disease stages,

^{12}and in the report by Choi et al.,

^{19}progressing eyes had a significantly different rate of change in this parameter than in nonprogressing eyes, which corresponds well with the high

*RU*values in example 1c (

*K-max*,

*K1*,

*K2*, and

*Km*data not shown). Despite CCT and TCT being well established clinically and both demonstrating significant differences in mean values for different stages of disease,

^{18}Choi et al.

^{19}reported that the annual change rates were not significantly different between progressing and stable eyes for these parameters, which corresponds with the poor RU values for pachymetry in example 1c (

*RU*<0.75). The high correspondence between

*RU*values and “best-performing” parameters in the literature indicates the possible merit of the RU index as an indicator of the “best-performing” parameters before longitudinal data are available.

*RU*indices aim to provide tools to help researcher and clinician get to grips with the within and between device differences in noise with respect to healthy and pathologic eyes. These indices are designed to provide an “at a glance” guideline of changes in measurement variability. The tolerance index may aid clinicians to distinguish more astutely real change from variability and the RU index may give an indication of which parameter is best poised to detect progression of disease in new devices in which no longitudinal data are available. However, these tools require investigation by other research groups before their merit can be accurately evaluated. If validated by other researchers, the tolerance and

*RU*indices have many potential applications in ophthalmology.

**C. Bergin**, None;

**I. Guber**, None;

**K. Hashemi**, None;

**F. Majo**, None

*Ophthalmology*. 2015; 122: 211–212.

*J Glaucoma*. 2000; 9: 247–253.

*Ophthalmic Physiol Opt*. 2011; 31: 330–338.

*Invest Ophthalmol Vis Sci*. 2011; 52: 7731–7737.

*Am J Ophthalmol*. 2009; 147: 467–472.

*Curr Eye Research*. 2013; 38: 674–679.

*Invest Ophthalmol Vis Sci*. 2012; 53; 7754–7759.

*Graefes Arch Clin Exp Ophthalmol*. 2013; 251: 1855–1860.

*International Journal of Keratoconus and Ectatic Corneal Diseases*. 2012; 1: 92–99.

*J Cataract Refract Surg*. 2011; 37: 1116–1124.

*Am J Ophthalmol*. 2014; 158: 32–40.

*Ophthalmology*. 2008; 115: 1534–1539.

*Cont Lens Anterior Eye*. 2014; 37: 26–30.

*J Cataract Refract Surg*. 2006; 32: 1281–1287.

*J Cataract Refract Surg*. 2011; 37: 1282–1290.

*Cornea*. 2012; 31: 253–258.

*Invest Ophthalmol Vis Sci*. 2012; 53: 927–935.