Abstract
purpose. To analyze the value of reading center error correction in automated optical coherence tomography (OCT; Stratus; Carl Zeiss Meditec, Inc., Dublin, CA) retinal thickness measurements in eyes with diabetic macular edema (DME).
methods. OCT scans (n = 6522) obtained in seven Diabetic Retinopathy Clinical Research Network (DRCR.net) studies were analyzed. The reading center evaluated whether the automated center point measurement appeared correct, and when it did not, measured it manually with calipers. Center point standard deviation (SD) as a percentage of thickness, center point thickness, signal strength, and analysis confidence were evaluated for their association with an automated measurement error (manual measurement needed and exceeded 12% of automated thickness). Curves were constructed for each factor by plotting the error rate against the proportion of scans sent to the reading center. The impact of measurement error on interpretation of clinical trial results and statistical power was also assessed.
results. SD was the best predictor of an automated measurement error. The other three variables did not augment the ability to predict an error using SD alone. Based on SD, an error rate of 5% or less could be achieved by sending only 33% of scans to the reading center (those with an SD ≥ 5%). Correcting automated errors had no appreciable effect on the interpretation of results from a completed randomized trial and had little impact on a trial’s statistical power.
conclusions. In DME clinical trials, the error involved with using automated Stratus OCT center point measurements is sufficiently small that results are not likely to be affected if scans are not routinely sent to a reading center, provided adequate quality control measures are in place.
Optical coherence tomography (OCT) is a noninvasive method for measuring the thickness of the central retina. It has become a standard tool in the management of patients with diabetic macular edema (DME). The Stratus OCT (Carl Zeiss Meditec, Dublin, CA) provides automated measurements of mean retinal thickness of the center point, central subfield, and each of four inner and four outer subfields (in micrometers), and total macular volume (cubic millimeters). The OCT software identifies the center point as the intersection of six radial lines and reports the center point thickness (CPT) as the average of the six measurements of thickness at the center point. An SD of the six measurements is then computed.
In the DME studies conducted by the Diabetic Retinopathy Clinical Research Network (DRCR.net), OCT scans have been sent to a central reading center for evaluation of quality, morphology, and accuracy of automated thickness measurements. DRCR.net has adopted the use of the central subfield for analysis of central retinal thickness data rather than the center point because of reduced variability in the measurement of the former compared with the latter.
1 2 However, only CPT can be easily manually measured when the automated measurement is incorrect. To limit missing data for scans with incorrect automated measurements, the center point manual measurement can be used to compute a value for the central subfield, since the two correlate highly.
1 3
Since most scans in DME have accurate automated measurements, we questioned the cost effectiveness of sending scans to a reading center to correct errors in the automated measurements compared with just using the automated measurements for analysis of clinical trial data. To answer our question, we used data collected in DRCR.net protocols.
The study included 6522 OCT scans at baseline and follow-up from eyes obtained in seven DRCR.net studies of DME. OCT operators were certified by the DRCR.net Reading Center at the University of Wisconsin, Madison. Some protocols required that central foveal edema be present in the study eyes whereas others did not. In each study, Stratus OCT was obtained after pupil dilation. Scans were 6 mm in length and included the six-radial-line pattern (128 scan resolution, fast macular scan option with Stratus OCT) for quantitative measures and the cross hair pattern (6–12 to 9–3 o’clock) at 512 scan resolution for qualitative assessment of retinal morphology.
1 OCT operators were instructed that if the SD of CPT was greater than or equal to 10% of the CPT or if signal strength exceeded 5, the scans should be repeated and submitted only if the OCT technician believed that the scans were of adequate quality or better quality measures were not obtainable after repeat scans.
Scans were sent to the DRCR.net Reading Center for evaluation. Scans with center point measurements graded as inaccurate by evaluators at the reading center were manually remeasured from prints of the retina map scans using digital calipers. The center point was manually measured for all scans with a center point SD ≥ 10.5%, which was computed as the SD of the six automated center point measurements divided by the CPT (average of the six measurements). This cutoff was selected based on the reading center’s prior experience in evaluating a random sample of images in which manual measurement of the center point was necessary in all images with center point SD ≥ 10.5%. For all other scans, if decentration (the center of the scan not aligned with the center of the macula) or boundary line artifacts (the instrument software incorrectly identified the internal and external limits of the retina) were identified, the CPT was manually measured unless poor scan quality precluded a measurement.
For analysis, each OCT scan was categorized as having a correct or incorrect center point measurement. The measurement was considered to be correct if either the reading center determined that a manual measurement was not needed or a manual measurement was deemed necessary but the automated measurement was within 12% of the manual measurement. Twelve percent was selected because it approximates the 95% confidence interval on an OCT measurement.
2 An automated measurement error was defined as an automated measurement that differed from a manual measurement by 12% or more. Automated measurements of scans in which poor quality precluded a manual measurement were also considered to be errors.
Four factors were evaluated as possible predictors of an incorrect automated center point measurement error: SD of the center point as a percentage of the CPT (SD), signal strength, the confidence of the analysis as reported by the instrument software, and the automated CPT. All four factors are present on or can be calculated from the OCT 3 printout. The correlation of each factor with the others was assessed with Spearman correlation coefficients. The association of each factor with the center point automated measurement error rate was assessed in univariate and multivariate logistic regression models. A cost–benefit type of analysis was used to evaluate further the relative importance of each factor in determining whether automated errors were present. The cost benefit analyses were performed by calculating the number of scans in which an incorrect center point measurement was detected (benefit) as a function of the proportion of images to be sent to the reading center for manual assessment (cost). Parallel analyses using receiver operating curve methods produced similar conclusions (data not shown). Additional cost–benefit curves were constructed for SD stratified by center point thickness, where separate curves were generated for CPT < 250 μm, CPT 250 to 299 μm, CPT 300 to 399 μm, CPT 400 to 499 μm, and CPT ≥ 500 μm.
To evaluate the impact of reading center evaluation of the center point automated measurements in a clinical trial, results from a DRCR.net protocol comparing two laser techniques for DME were used in which the results obtained with reading center evaluation of all scans were compared with the results that would have been obtained using just the automated measurements.
4 For this analysis, change in CPT from baseline to 12 months was compared between treatment groups using repeated measures least-squares regression models adjusted for baseline thickness and accounting for the correlated data from subjects with two study eyes. In parallel fashion, the treatment groups were compared for change in central subfield thickness, with and without replacing invalid automated central subfield values with values imputed from manually measured CPT.
1
The impact on a trial’s sample size of using solely automated center point measurements versus having a reading center evaluate the scans was assessed using the fact that sample size is proportional to the overall variance. The overall variance was calculated as the between-subject variation plus measurement error. The percentage reduction in variance (and therefore sample size) was calculated assuming that manual measurement would give the correct thickness value (i.e., zero measurement error for scans sent to the reading center). Sample size for a hypothetical trial with 90% power, type 1 error of 5%, assuming an effect size of 50 μm with an SD of 150 μm was then computed, varying the number of images sent to the reading center. Statistical analyses were performed with commercial software (SAS software, ver. 9.1; SAS Institute, Cary, NC).
In this study, we have shown that in a large randomized trial, there is little benefit derived from sending all scans to a reading center to evaluate whether the automated measurement of CPT is correct in eyes with DME. When all scans were sent, approximately 86% had accurate automated measurements. Manually measuring the CPT for the other 14% enhanced overall accuracy, but only slightly. In doing this, the mean difference in CPT differed by only 5 μm and the mean difference in central subfield thickness differed by only 4 μm, compared with having all scans reviewed by a reading center. Similar results are obtained using retinal volume as the outcome measures.
The SD of the center point is more predictive of the probability of an error in the automated measurements than is signal strength or the confidence assessment. Both of these measures are provided by the Stratus OCT software as measures of scan quality but neither added much to the center point SD. Although automated errors are more likely for eyes with center involved DME when the fovea is thinner, retinal thickness also added little to the predictive ability of the center point SD alone. If there is a desire to reduce the error rate from the expected 14%, a portion of the scans could be sent to the reading center. Our results indicate that the center point SD is the best measure to use in this regard and an error rate of 5% can be achieved by sending to a reading center scans with a center point SD exceeding 5%. However, this has no appreciable impact on the error present in the mean change in CPT.
Evaluation of OCT scan quality by the clinical site OCT technicians and investigators would also increase data accuracy. The SD of the center point may still be less than 5%, even if the scan has decentration and/or boundary line errors. By repeating erroneous scans until a satisfactory quality scan is obtained for submission, operator-error issues can be corrected. Specific training regarding OCT quality assessment is required for optimal clinical site performance. Despite this caveat, it is unlikely that the frequency of erroneous scans could be reduced significantly in this study considering in the DRCR.net protocols, OCT technicians were instructed to repeat the OCT measurement if the SD was ≥10% or the signal strength was ≤6 (i.e., images of potentially poor quality).
There is a tradeoff for a clinical trial in reducing or eliminating the number of scans sent to a reading center. Sending scans to the reading center and correcting errors in the center point measurement will reduce the required sample size for a given statistical power. Our data suggest that the increase in sample size is likely to be small, approximately 6%, in a study of center-involved DME powered to assess treatment group differences in the change in central subfield thickness if no scans were sent to a reading center compared with having all scans sent to a reading center. In large trials, even a 6% increase in sample size could lead to a significant increase in study cost. Therefore, in planning a trial, the cost per additional subject recruited versus the cost to have OCT scans sent to a reading center for grading must be considered to determine the most cost-effective strategy for the trial.
When the retina is thicker, the likelihood of a measurement error decreases. Thus, the yield is lower in sending scans to a reading center for evaluation. This is probably due to a detection bias, where a decentered scan is more likely to be identified if the retina has a morphologically distinct center (such as a foveal depression) which is more likely to be observed in a thinner retina compared to a thickened retina with disorganized morphology. In determining whether to send scans to a reading center for a protocol, there may be less reason to send the scans for evaluation in a study in which eligibility requires substantial central retinal thickening than there would be in a protocol in which central thickening was required to be absent and the objective of the intervention was the prevention of the development of DME. In the latter circumstance, the misclassification rate of outcome due to automated errors could necessitate that all scans be evaluated by a reading center.
DRCR.net is using the results of this study to determine for each protocol which scans should be sent to the reading center for evaluation and manual measurement of the center point when indicated. For protocols in which OCT-measured retinal thickness is not the primary outcome measure and morphology grading is not required, in general either no scans are sent to the reading center or only scans exceeding a specified center point SD value or less than a specified CPT are sent to the reading center.
It is important to recognize that these results apply to cases of DME and not to age-related macular degeneration or other macular conditions in which the automated error rate can be much higher.
5 6 7 8 In addition, these results apply to the software available with the Stratus OCT (Carl Zeiss Meditec, Inc.) and not necessarily to software algorithms of other OCT devices. Nevertheless, the principles followed in this analysis may be generalized to other devices and all conditions with regard to determining the most efficient strategy for having a reading center evaluate retinal thickening, whether by OCT or other technology, in a clinical trial. Finally, frequency domain OCT with registration capability to fundus landmarks is expected to reduce decentration errors, and better retina sectioning algorithms are expected to reduce boundary line errors in the near future. It appears that for Stratus OCT, manual grading adds little measurement accuracy to studies of DME. Manual grading of thickness measurements of frequency domain OCT images in clinical trials of the future may be even less important. Reading centers provide a mechanism for masked assessment of morphology and will be important for quality control in clinical trials employing only the numeric data from the OCT output. With the common scarcity of resources and the need to reallocate available support to other important aspects of clinical trials, utilization of reading center analyses where necessary, and avoidance where not of major impact will permit cost savings without substantially sacrificing data accuracy.
A current list of the Diabetic Retinopathy Clinical Research Network is available at http://www.drcr.net.
Supported through a cooperative agreement from the National Eye Institute and the National Institute of Diabetes and Digestive and Kidney Diseases EY14231, EY14269, EY14229.
Submitted for publication February 14, 2008; revised May 8, 2008; accepted November 20, 2008.
Disclosure:
A.R. Glassman, None;
R.W. Beck, None;
D.J. Browning, None;
R.P. Danis, None;
C. Kollman, None
The publication costs of this article were defrayed in part by page charge payment. This article must therefore be marked “
advertisement” in accordance with 18 U.S.C. §1734 solely to indicate this fact.
No reprints will be available.
Corresponding author: Adam R. Glassman, c/o Jaeb Center for Health Research, 15310 Amberly Drive, Suite 350, Tampa, FL 33647;
[email protected].
Table 1. Frequency of Manual Grading for Factors Potentially Predictive of Center Point Errors*
Table 1. Frequency of Manual Grading for Factors Potentially Predictive of Center Point Errors*
Factor | n | Error Rate* n (%) |
Overall | 6522 | 881 (14) |
Standard deviation of center point | | |
0% | 67 | 6 (9) |
>0%–1% | 783 | 32 (4) |
>1%–2% | 1130 | 55 (5) |
>2%–3% | 1049 | 84 (8) |
>3%–4% | 855 | 69 (8) |
>4%–5% | 655 | 67 (10) |
>5%–6% | 482 | 45 (9) |
>6%–7% | 343 | 55 (16) |
>7%–8% | 252 | 44 (17) |
>8%–9% | 197 | 33 (17) |
>9%–10% | 104 | 22 (21) |
≥11% | 605 | 369 (61) |
Analysis confidence low | | |
Absent | 5384 | 567 (11) |
Present | 1138 | 314 (28) |
Signal strength | | |
0 | 24 | 16 (67) |
1 | 56 | 17 (30) |
2 | 123 | 31 (25) |
3 | 308 | 53 (17) |
4 | 714 | 115 (16) |
5 | 1268 | 174 (14) |
6 | 1577 | 206 (13) |
7 | 1260 | 151 (12) |
8 | 713 | 84 (12) |
9 | 324 | 28 (9) |
10 | 155 | 6 (4) |
Center point thickness | | |
(A) <225 | 2363 | 312 (13) |
(B) 225–249 | 502 | 115 (23) |
(C) 250–299 | 871 | 199 (23) |
(D) 300–399 | 1225 | 154 (13) |
(E) 400–499 | 742 | 55 (7) |
(F) 500–599 | 464 | 19 (4) |
(G) 600–699 | 237 | 16 (7) |
(H) ≥700 | 118 | 11 (9) |
Table 2. Potentially Predictive Factors of the Need for Manual Grade in Logistic Regression Models
Table 2. Potentially Predictive Factors of the Need for Manual Grade in Logistic Regression Models
Factor | Univariate Model of Predictive Factor | | | Multivariate Model | | |
| Estimate [OR] (95% CI) | P | R 2 | Estimate [OR] (95% CI) | P Final Model | Cumulative* R 2 |
Ratio of SD to CPT (per 1% increase) | 1.20 (1.18–1.22) | 0.001 | 14% | 1.18 (1.17–1.20) | 0.001 | 14% |
OCT CPT (per 100 μm increase) | 0.82 (0.78–0.87) | 0.001 | 0.8% | 0.80 (0.74–0.85) | 0.001 | 15% |
Signal strength (per 1 unit increase) analysis | 0.85 (0.81–0.88) | 0.001 | 1% | 0.93 (0.89–0.98) | 0.004 | 15% |
Confidence (low vs. not low) | 3.24 (2.77–3.79) | 0.001 | 3% | 1.17 (0.93–1.47) | 0.17 | 15% |
Table 3. Comparison of Thickness Data from a DRCR.net Randomized Trial
Table 3. Comparison of Thickness Data from a DRCR.net Randomized Trial
| Baseline | | 1 Year | | Change | | P |
| Modified-ETDRS | MMG | Modified-ETDRS | MMG | Modified-ETDRS | MMG | |
Center point measurement | | | | | | | |
Automatic measurements* | 319 ± 149 | 322 ± 129 | 257 ± 97 | 287 ± 135 | −63 ± 144 | −41 ± 140 | 0.04 |
Corrected dataset, † | 325 ± 156 | 329 ± 133 | 254 ± 100 | 294 ± 153 | −71 ± 150 | −37 ± 141 | 0.01 |
Central subfield | | | | | | | |
Automatic measurements* | 336 ± 131 | 339 ± 111 | 273 ± 73 | 304 ± 115 | −60 ± 128 | −40 ± 118 | 0.02 |
Corrected dataset, † | 340 ± 134 | 343 ± 111 | 274 ± 80 | 308 ± 129 | −66 ± 129 | −37 ± 118 | 0.01 |