Abstract
purpose. An automated system for the measurement of microaneurysm (MA) turnover was developed and compared with manual measurement. The system analyses serial fluorescein angiogram (FA) or red-free (RF) fundus images; fluorescein angiography was used in this study because it is the more sensitive test for MAs. Previous studies have shown that the absolute number of MAs observed does not reflect the dynamic temporal nature of the MA population. In this study, almost half of the MAs present at baseline had regressed after a year and been replaced by new lesions elsewhere.
methods. Two clinical datasets were used to evaluate the performance of the automated turnover measurement system. The first consisted of 10 patients who had two fluorescein angiograms acquired a year apart. These data were analyzed, both manually and using the automated system, to investigate the inter- and intraobserver variations associated with manual measurement and to assess the performance of the automated system. The second dataset contained FAs from a further 25 patients. This dataset was analyzed only with the automated system to investigate some properties of microaneurysm turnover, in particular the differing detection sensitivities of new, static and regressed microaneurysms.
results. Manual measurements exhibited large inter- and intraobserver variation. The sensitivity and specificity of the automated system were similar to those of the human observers. However, the automated measurements were more consistent—an important condition for accurate turnover quantification. Regressed MAs were more difficult to detect reliably than new MAs, which were themselves more difficult to detect reliably than static MAs.
conclusions. The automated system was shown to be fast, reliable, and repeatable, making it suitable for processing large numbers of images. Performance was similar to that of trained manual observers.
Microaneurysms (MAs) are one of the earliest lesions visible in diabetic retinopathy. The total number of MAs has been shown to be an indicator of likely future progression to more severe retinopathy, making MAs a cardinal feature of early diabetic retinopathy.
1 2 3 4 The MA population is known to be dynamic: new MAs are continually forming while existing ones disappear. The overall number of MAs often remains approximately constant, and therefore does not reflect the extensive changes taking place in the retina.
5 To date only a limited number of studies have investigated the turnover in the MA population. There are considerable practical problems associated with the measurement, not least the time required to analyze even modest numbers of images manually. Computer assistance has been used in the past to aid in the measurement,
6 7 but these studies still required manual identification of every MA separately.
The accuracy of manual measurements is limited by the unavoidable variation in observer performance. Although counting the total number of MAs in an image has been shown to be reasonably reproducible, matching individual MAs in serial images is much less so, because an error in either image results in an erroneously classified MA. Human variation derives from two sources: unconscious variations in the observer’s criteria for identifying an MA and tiredness and fatigue. A fully automated system is not affected by either of these problems.
We have described previously a fully automated computer system for detecting and quantifying MAs in fluorescein angiograms and red-free images, with performance similar to that of trained clinicians.
8 9 10 It was found that automatic alignment of the follow-up image with the baseline image allows MA turnover to be calculated without manual intervention. The system was shown to be fast, reliable, and repeatable, making it suitable for processing large numbers of images.
In this study two clinical datasets were used to evaluate the automated turnover measurement system. The first consisted of 10 patients in whom two fluorescein angiograms were acquired a year apart. These data were analyzed both manually and using the automated system to investigate the inter- and intraobserver variation associated with manual measurement and to assess the performance of the automated system. The second dataset contains fluorescein angiograms from 25 different patients. This larger dataset was analyzed using the automated system alone to investigate further some properties of microaneurysm turnover, in particular the differing sensitivities for the detection of new, static, and regressed microaneurysms.
First Clinical Dataset.
Second Clinical Dataset.
An ophthalmologist who was experienced in identifying and counting microaneurysms in angiographic images and also in grading retinal images using standard grading techniques was chosen as the standard observer. The same software used by the other manual observers was used to annotate MA positions for the reference standard.
The standard observer performed the analysis on two occasions 6 months apart. A total of 270 MAs were identified in the set of 20 images during the first session. Six months later, 292 MAs were marked, of which 246 matched MAs found during the first session. Seventy MAs were marked in only one of the sessions (24 in the first session, and 46 in the second).
The reproducibility (
r) is defined as the ratio of the number of MAs that matched in both sessions to the number of unique MAs found in both sessions expressed as a percentage, given by
\[r{=}\frac{N_{\mathrm{c}}}{N_{\mathrm{c}}{+}N_{1}{+}N_{2}}{\times}100\%\]
where
N c is the number of MAs common to both sessions,
N 1 is the number of MAs unique to session 1, and
N 2 the number of MAs unique to session 2. From this calculation, the reproducibility of the standard observer was 78%, which is equal to the best result published to date for a turnover study.
7 14 15 The 246 MAs that were common to both sessions were used as the reference standard, because these represented the lesions considered most likely to be true MAs.
To evaluate basic MA detection using the MA detector (at both operating points) the 10 pairs of images were treated as 20 independent images and compared with the standard result. Turnover results were calculated from the 10 pairs of images (using both detector operating points) and the numbers of static, new, and regressed MAs compared with the reference standard.
The second study of 25 patients was analyzed using the automated system. The average number of MAs per image was measured and turnover analysis performed to determine the numbers of static, new, and regressed MAs.
Each type of MA (static, new, and regressed) may be misclassified by an error in one or both of the images. If the probability of an error is independent of MA type, then changes in the overall MA detection sensitivity should affect the three types equally.
To test the null hypothesis that there is no difference in the detectability of the three MA types the automated detector was run twice on each dataset—the second time using a setting with greater specificity but lower sensitivity. The numbers of static, new, and regressed MAs that were detected at both sensitivity settings were recorded.
The mean number of MAs in the reference standard was 12.3 ± 8.5 (1−27) per image. In the baseline images, the average was 10.7 ± 8.7 (1−27), compared with 13.9 ± 8.9 (3−33) in the follow-up images. The number of MAs increased in 7 patients and decreased in only 2 (remaining constant in 1), though for this small sample the difference is not statistically significant (Wilcoxon test; P ≤ 0.16). However, although the mean number of MAs apparently indicates a reasonably static MA population, turnover analysis found only 61 of the MAs to be truly static: there were 78 new MAs and 46 MAs that had regressed. Hence, only 57% of the baseline MAs were found in the follow-up images, a large change in the MA population that is not reflected by the mean number of MAs.
Interobserver variation was determined by comparing nine observers with the reference standard. The results are shown in the FROC plots in
Figures 2 3 4 5 .
Figure 2 shows the result for basic MA detection. The curve was fitted to the results of the manual observers, including only the first result from observers who completed the assessment three times and excluding the standard observer (because this result was used to create the reference standard). Despite all the observers having previous experience grading retinopathy, there was considerable variation in both the sensitivities and false-positive rates between observers.
Figures 3 4 and 5 show FROC plots for turnover measurement of static, new, and regressed MAs, respectively. As for basic MA detection, there is a wide variation in both the sensitivity and false-positive rate between observers. All the observers performed particularly poorly in detecting regressed MAs.
Intraobserver variation was determined by three observers who performed the manual measurements on three separate occasions.
Table 1 shows the results from the three observers. For each observer, the results of the second and third sessions were compared with those from their first session. The total number of MAs marked in each session is listed, together with the percentage of MAs marked that were also marked on the first visit, the percentage of the MAs marked in the first session that were detected in the subsequent session, and the reproducibility value as defined earlier. The more conservative the observer (i.e., the fewer total MAs), the more reproducible the result.
To evaluate basic MA detection using the automated MA detector the 10 pairs of images in the first patient group were first treated as 20 independent images.
Figure 2 shows the FROC graph comparing the results from the manual observers and the automated system (both operating points) with the reference standard.
The MA turnover results for the 10 pairs of images from the first patient group, comparing the manual observers and the automated system with the reference standard, are shown as FROC graphs: static MAs in
Figure 3 , new MAs in
Figure 4 , and regressed MAs in
Figure 5 . As before, the graphs include only the first results from observers who completed the assessment three times, and the results of the standard observer are not shown.
The automated detector was applied to the second clinical dataset. Overall, the mean number of MAs per image was 49.1 ± 50.0 (1−245). In the baseline images the average was 41.4 ± 49.5 (1−245), compared with 56.9 ± 50.4 (6−204) in the follow-up images. As for the first dataset, the difference failed to achieve significance at the 5% level (Wilcoxon test; P ≤ 0.08). Neither was there a significant difference between the mean number of MAs at baseline and in the follow-up images after categorizing the patients by their retinopathy grades at baseline: The mean number of MAs per image in the mild group was 30.1 ± 21.4 (1−74) at baseline and 48.8 ± 43.1 (6−187) at follow-up. In the moderate group, the mean number of MAs per image was 61.6 ± 76.0 (3−245) at baseline and 71.2 ± 61.4 (10−204) at follow-up. The number of MAs increased in 16 patients, decreased in 8 and remained constant in 1. Once again, the small change in the absolute number of MAs did not reflect the high level of turnover. A total of 22.1 ± 32.1 (1−163) MAs per patient were static compared with 34.8 ± 35.6 (2−171) new MAs and 19.3 ± 20.3 (0−82) regressed MAs.
The automated system was applied to the data twice: once using a setting with higher sensitivity and once using a setting with higher specificity. In the first dataset 88% of static MAs, 89% of new MAs, and 66% of regressed MAs were detected at both settings. The difference between the proportions of static MAs and new MAs was not significant (χ2 test; P ≤ 0.8). However, the difference between the proportion of regressed MAs and that of either the static or the new MAs was significant (χ2 test; P < 0.01).
In the second dataset 93% of static MAs, 87% of new MAs, and 80% of regressed MAs were detected at both settings. The differences between the proportions of the three MA types were all significant (χ2 test; P < 0.01).
The number of microaneurysms present has been shown to reflect the severity of diabetic retinopathy and to predict future disease progression. Furthermore, it may provide a simple and objective measure that will be useful for monitoring treatment response or diabetes control. Measuring the turnover of MAs is more difficult than determining the absolute number of MAs; however, it provides data about the dynamic nature of the disease that are not conveyed by the absolute count. It is not yet known whether the turnover data, for instance the number of newly formed MAs, would provide a better indicator of disease progression, but intuitively it is likely to do so.
Simple counting of MAs in an image has been shown to be reasonably robust: Errors of omission tend fortuitously to be balanced by errors of inclusion.
1 2 3 18 In contrast, both of these errors confound turnover measurement. Furthermore, under nonideal conditions, factors such as fatigue and distraction also increase the number of errors. Observer errors generate “turnover noise”—artifactual turnover caused by false-negative and false-positive MA identifications. Turnover noise arises from two related sources: straightforward errors detecting MAs (described by the sensitivity and false-positive rate) and shifts in the operator decision criteria, leading to variations in the sensitivity and specificity between the baseline and follow-up sessions. The shift in the decision criteria is inevitable with human observers
19 —for example, slightly smaller lesions may be accepted in the second session, thereby including MAs in the selection process that were summarily excluded during the first session.
Table 1 showed the large variation of the three observers in the study, each of whom analyzed the same images on three separate occasions.
The absolute level of turnover noise is difficult to measure. There is no independent, reliable, and noninvasive method for counting MAs currently available to act as a gold standard. Therefore, it was necessary to designate the observer with the greatest experience of grading retinopathy images as the reference standard. The other observers, both manual and automated, were compared with this standard.
The level of turnover noise due to human variation may be estimated, without an independent gold standard, by comparison of repeat measurements on the same images where the true result (i.e., zero turnover) is known. The most consistent observer in this study—the standard observer—had a reproducibility of 78%. This is equal to the best reproducibility achieved in any turnover study published to date (previous results range from 40% to 78%).
7 14 15 Even so, 22% (70/316) of the MAs marked in the two sessions were indicated inconsistently. This underestimates the total turnover error because it fails to consider consistent errors (i.e., false-positive and false-negative lesions in both sessions). Nevertheless, this statistic is useful because it represents a lower bound on the uncertainty associated with manual turnover measurements. However, as will be described, the actual uncertainty depends on the relative proportions of static, new and regressed MAs present.
These results are probably approaching the best performance possible by manual measurement. Turnover noise was greater when different observers grade the baseline and follow-up images, because of interobserver variation. For the most consistent results, the baseline and follow-up images should be annotated at the same time and by the same observer, to ensure similar selection criteria are applied to both images. Unfortunately, this soon becomes impractical as the number of follow-up sessions increases.
The turnover noise associated with the automated method cannot be estimated in the same way; the reproducibility is 100%, since the computer will always return the same result for the same pair of images. From the FROC graphs in
Figures 2 3 4 5 , the performance of the automated detector was similar to that of the manual observers. Although the automated method was apparently not as sensitive as some of the manual observers (though it is more specific than most), for robust turnover measurement, the consistency of the automated system may be more important than its slightly poorer sensitivity.
Turnover analysis of the first clinical dataset (shown on the FROC graphs in
Figures 3 4 5 ) revealed an asymmetry in the detection sensitivities for static, new, and regressed MAs. Regressed MAs in particular appeared more difficult to detect reliably than static and new MAs. This finding could have been spurious: for instance, if the original images were biased in some way (e.g., if the first-session images were, on average, of poorer quality), or if all the observers were biased (e.g., by virtue of the order of presentation of the images), or if the reference standard alone was biased. These latter two possibilities were ruled out by the repeated automated measurement, using two different sensitivity settings. This demonstrated an asymmetry similar to that found by the manual observers, in which regressed MAs appeared more difficult to detect reliably. Finally, the likelihood that the original images in the first dataset were intrinsically biased was greatly reduced by demonstrating the same effect in the second, larger patient dataset.
The different sensitivities appear to be a genuine phenomenon wherein static MAs are intrinsically easier to detect than new and regressed MAs. This is probably due to the appearance and disappearance of the MAs not being instantaneous. Instead they pass at different rates through an intermediate stage where their identification is equivocal. The effect appears most pronounced during regression. Consequently, the uncertainty associated with the turnover results are greatest for regressed MAs and least for static MAs.
Diabetic retinopathy is a condition that progresses relatively slowly. It was necessary for trials such as the Diabetes Control and Complications Trial (DCCT)
20 and the UK Prospective Diabetes Study (UKPDS)
21 to recruit a large number of patients and observe them for many years, to enable trial end points to be reached—typical end points being new vessel formation or macular thickening. However, diabetic retinopathy is predominantly a disease of capillary occlusion. MA turnover is likely to produce more sensitive measures of these changes than relying on later complications. In contrast to measures of absolute counts, which have been shown in the current study to hide the continual process of capillary occlusion and remodeling, turnover measures provide information about the dynamic nature of the disease at the capillary level. Given a larger dataset it will be interesting to show whether rates of turnover of the new, static, and regressed MAs correlate with the current retinopathy grading and whether they provide a useful prognostic indicator of disease development.
In summary, an automated system for quantifying MA turnover was developed and compared with manual measurements. The automated system was fast and was shown to be reliable, making it suitable for processing studies containing large numbers of images. The system also worked with red-free images. Although fewer MAs were visible on red-free images, the noninvasive nature of the procedure is attractive, and work has been undertaken to investigate whether red-free turnover correlates with the turnover on angiograms. The automated system may have value in a screening context, in treatment evaluation, or for research on the dynamic nature and behavior of the MA population.
Submitted for publication September 17, 2002; revised March 31 and August 4, 2003; accepted August 11, 2003.
Disclosure:
K.A. Goatman, None;
M.J. Cree, None;
J.A. Olson, None;
J.V. Forrester, None;
P.F. Sharp, None
The publication costs of this article were defrayed in part by page charge payment. This article must therefore be marked “
advertisement” in accordance with 18 U.S.C. §1734 solely to indicate this fact.
Corresponding author: Keith A. Goatman, Department of Bio-Medical Physics and Bio-Engineering, University of Aberdeen, Foresterhill, Aberdeen, AB25 2ZD, UK;
[email protected].
Table 1. Intraobserver Reproducibility Comparison
Table 1. Intraobserver Reproducibility Comparison
Observer | Total MAs (n) | Matching (%) | Detected (%) | Reproducibility (%) |
A | 467 | — | — | — |
A | 1124 | 39.9 | 95.9 | 39.2 |
A | 963 | 46.2 | 95.3 | 45.1 |
B | 270 | — | — | — |
B | 292 | 84.2 | 91.1 | 77.8 |
B | 292 | 80.1 | 86.7 | 71.1 |
C | 313 | — | — | — |
C | 318 | 83.3 | 84.7 | 72.4 |
C | 340 | 83.5 | 90.7 | 77.0 |
The authors thank Fiona Strachan for help with the second dataset and Dina Christopoulou, Parwez Hossain, Chea Lim, and Alasdair Purdie for kindly acting as observers.
Klein R, Meuer SM, Moss SE, Klein BEK. The relationship of retinal microaneurysm counts to the 4-year progression of diabetic retinopathy. Arch Ophthalmol
. 1989;107:1780–1785.
[CrossRef] [PubMed]Kohner EM, Sleightholm M, KROC Collaborative Study Group. Does microaneurysm count reflect severity of early diabetic retinopathy?. Ophthalmology
. 1986;93:586–589.
[CrossRef] [PubMed]Klein R, Meuer SM, Moss SE, Klein BEK. Retinal microaneurysm counts and 10-year progression of diabetic retinopathy. Arch Ophthalmol
. 1995;113:1386–1391.
[CrossRef] [PubMed]Kohner EM, Stratton IM, Aldington SJ, Turner RC, Matthews DR. Microaneurysms in the development of diabetic retinopathy (UKPDS 42). Diabetologia
. 1999;42:1107–1112.
[CrossRef] [PubMed]Kohner EM, Dollery CT. The rate of formation and disappearance of microaneurysms in diabetic retinopathy. Eur J Clin Invest
. 1970;1:167–171.
[CrossRef] [PubMed]Hellstedt T, Palsi VP, Immonen I. A computerised system for localisation of diabetic lesions from fundus images. Acta Ophthalmol. 1994;72:352–356.
Hellstedt T, Immonen I. Disappearance and formation rates of microaneurysms in early diabetic retinopathy. Br J Ophthalmol
. 1996;80:135–139.
[CrossRef] [PubMed]Spencer T, Phillips RP, Sharp PF, Forrester JV. Automated detection and quantification of microaneurysms in fluorescein angiograms. Graefes Arch Clin Exp Ophthalmol
. 1992;230:36–41.
[CrossRef] [PubMed]Cree MJ, Olson JA, McHardy KC, Sharp PF, Forrester JV. A fully automated comparative microaneurysm digital detection system. Eye
. 1997;11:622–628.
[CrossRef] [PubMed]Hipwell JH, Strachan F, Olson JA, McHardy KC, Sharp PF, Forrester JV. Automated detection of microaneurysms in digital red-free photographs: a diabetic retinopathy screening tool. Diabet Med
. 2000;17:588–594.
[CrossRef] [PubMed]Klein BE, Davis MD, Segal P, et al. Diabetic retinopathy: assessment of severity and progression. Ophthalmology
. 1984;91:10–17.
[PubMed]Aldington SJ, Kohner EM, Meuer S, Klein R, Sjølie AK. Methodology for retinal photography and assessment of diabetic retinopathy: the EURODIAB IDDM complications study. Diabetologia
. 1995;38:437–444.
[CrossRef] [PubMed]Cideciyan AV, Jacobson SG, Kemp CM, Knighton RW, Nagel JH. Registration of high resolution images of the retina. Proc SPIE. 1992;1652:310–322.
Baudoin C, Maneschi F, Quentel G, et al. Quantitative evaluation of fluorescein angiograms: microaneurysm counts. Diabetes
. 1983;32:8–13.
[CrossRef] [PubMed]Jalli PYI, Hellstedt TJ, Immonen IJR. Early versus late staining of microaneurysms in fluorescein angiography. Retina
. 1997;17:211–215.
[CrossRef] [PubMed]Metz CE. ROC methodology in radiologic imaging. Invest Radiol
. 1986;21:720–733.
[CrossRef] [PubMed]Chakraborty DP. Maximum likelihood analysis of free-response receiver operator characteristic (FROC) data. Med Phys
. 1989;16:561–565.
[CrossRef] [PubMed] DAMAD Study Group. Effect of aspirin alone and aspirin plus dipyridamole in early diabetic retinopathy: a multicentre randomised controlled clinical trial. Diabetes
. 1989;38:491–498.
[CrossRef] [PubMed]Swensson RG. Unified measurement of observer performance in detecting and localizing target objects on images. Med Phys
. 1996;23:1709–1725.
[CrossRef] [PubMed]Shamoon H, Duffy H, Fleischer N, et al. Implementation of treatment protocols in the diabetes control and complications trial. Diabetes Care
. 1995;18:361–376.
[CrossRef] [PubMed]Turner RC, Holman RR, Matthews D, et al. UK prospective diabetes study (UKPDS): study design, progress and performance. Diabetologia
. 1991;34:877–890.
[CrossRef] [PubMed]