Abstract
Purpose.:
Recent studies on diabetic retinopathy (DR) screening in fundus photographs suggest that disagreements between algorithms and clinicians are now comparable to disagreements among clinicians. The purpose of this study is to (1) determine whether this observation also holds for automated DR severity assessment algorithms, and (2) show the interest of such algorithms in clinical practice.
Methods.:
A dataset of 85 consecutive DR examinations (168 eyes, 1176 multimodal eye fundus photographs) was collected at Brest University Hospital (Brest, France). Two clinicians with different experience levels determined DR severity in each eye, according to the International Clinical Diabetic Retinopathy Disease Severity (ICDRS) scale. Based on Cohen's kappa (κ) measurements, the performance of clinicians at assessing DR severity was compared to the performance of state-of-the-art content-based image retrieval (CBIR) algorithms from our group.
Results.:
At assessing DR severity in each patient, intraobserver agreement was κ = 0.769 for the most experienced clinician. Interobserver agreement between clinicians was κ = 0.526. Interobserver agreement between the most experienced clinicians and the most advanced algorithm was κ = 0.592. Besides, the most advanced algorithm was often able to predict agreements and disagreements between clinicians.
Conclusions.:
Automated DR severity assessment algorithms, trained to imitate experienced clinicians, can be used to predict when young clinicians would agree or disagree with their more experienced fellow members. Such algorithms may thus be used in clinical practice to help validate or invalidate their diagnoses. CBIR algorithms, in particular, may also be used for pooling diagnostic knowledge among peers, with applications in training and coordination of clinicians' prescriptions.
Diabetic retinopathy (DR) is the leading cause of blindness in the working population of the United States and the European Union.
1,2 Detecting and monitoring DR in the at-risk population (diabetics), generally using eye fundus photography, is crucial for providing timely treatment of DR and therefore preventing visual loss.
3 With the likelihood of an increase in the at-risk population,
4 many computer image analysis algorithms have been proposed in the literature to analyze fundus photographs automatically.
5,6 Recently, several image-analysis groups compared the performance of their image analysis algorithms with the performance of clinicians in detecting DR in large-scale screening programs.
7 –10 Disagreements between algorithms and clinicians were found to be comparable to disagreements among clinicians.
9
Screening DR mainly involves detecting microaneurysms, usually the first appearing signs of DR, although detecting hemorrhages and exudates has also been proven useful.
11 In comparison, grading (i.e., monitoring) DR is much more complicated: according to the International Clinical Diabetic Retinopathy Disease Severity (ICDRS) scale, it involves detecting additional types of lesions (e.g., neovascularizations and intraretinal microvascular abnormalities) with a larger variability of scales and shapes.
12 As a consequence, little work has been done so far to automate DR grading, in comparison with DR screening. However, several algorithms of increasing complexity have recently been proposed by our group for automated DR grading using fundus photography,
13,14 together with demographic data in the most advanced algorithms.
15,16 These algorithms are all based on the content-based image retrieval (CBIR) paradigm, which is explained hereafter.
17,18 In CBIR, the content of images is characterized with low-level features (e.g., color, texture, and local shape) and these features are mapped to concepts (e.g., DR severity) using machine learning; in particular, it is not necessary to develop a dedicated algorithm for each lesion type. Once a query image has been automatically characterized, the most similar images, together with their medical interpretations, are searched for in a reference database. These most similar images are then used to infer an automated diagnosis for the query image. Note that a CBIR approach, relying on dedicated algorithms and limited manual inputs, has also been proposed by another image analysis group.
19
The purpose of this study was twofold: first, to discover whether, as in a DR screening context, disagreements between automated DR grading algorithms and clinicians are comparable to disagreements among clinicians. For this purpose, a dataset of 85 consecutive DR examinations (1176 images) was graded, at eye and patient level, by two clinicians with different experience levels. The dataset was also automatically graded by three (one novel and two recently published) algorithms from our group. The first two automated algorithms analyze images independently to make a decision; the second algorithm, which is novel, allows additional flexibility in the way images are characterized. The third algorithm uses both image characterizations and demographic data to make a decision; note that analyzing both image characterizations and demographic data are in the DR monitoring protocol. Second, the possible uses of automated DR-grading algorithms in a clinical context were examined. The place of automated DR screening algorithms in clinical practice has been widely discussed.
7,9 —these algorithms may be used for triage, in that physicians are unable to screen the entire at-risk population—but those of automated DR grading have not. The specific advantages of CBIR in this context are emphasized.
Two clinicians were involved in this study: one with 7 years' (
Clinician1) and the other with 2 years' (
Clinician2) experience. Each clinician was asked to grade disease severity in all 168 eyes according to a modified ICDRS scale. This scale consists of the five ICDRS levels: 0, no apparent DR; 1, mild nonproliferative DR; 2, moderate nonproliferative DR; 3, severe nonproliferative DR; and 4, proliferative DR,
12 as well as an additional level (5, treated DR). The clinicians interpreted the 168 eyes in randomized order. When interpreting one eye, the clinicians had access to all seven photographs, as well as the available demographic data, but they were masked to all photographs and interpretations from the contralateral eye.
Two months later, Clinician1 interpreted the dataset a second time. Therefore, three interpretations are available at eye level: Clinician1 contributed EyeGrades1a and EyeGrades1b; Clinician2 contributed EyeGrades2.
Based on each of these three interpretations, disease severity was also determined (automatically) at patient level: When a patient contributed two eyes to the study, disease severity at patient level was defined as the maximum disease severity at eye level among those two eyes. Therefore, three interpretations are also available at patient level: PatientGrades1a, PatientGrades1b, and PatientGrades2.
The proposed algorithms rely on the machine learning paradigm, and so examples are necessary for the training. In this purpose, DRD was divided between two subsets (A and B) with equal distribution of sex, diabetes type, and DR severity. All eyes from the same patient were assigned to the same subset. Except for the above-mentioned conditions, DRD was divided randomly between the two subsets.
EyeGrades1a and PatientGrades1a were used as the reference standard for algorithm supervision at eye level and patient level, respectively. Performance was assessed by two-fold cross-validation. At first, subset A was used for training (i.e., tuning the algorithms), and subset B was used for testing (i.e., comparing the outputs of the algorithms with the reference standard). Then, subset B was used for training, and subset A was used for testing.
To automatically grade disease severity in a query eye (or patient)
Q, the following procedure was applied:
-
The digital content of each image Iq in Q and of each image I ts in the training subset was automatically characterized by a feature vector.
-
The distance between the characterization of Iq and that of each image I ts in the training subset was computed.
-
The k nearest neighbors of Iq , within the training subset, were sought with respect to the distance measure in step 2.
-
The most frequent diagnosis among the nearest neighbors of every image Iq in Q (according to the reference standard) was assigned to Q.
The first three steps are the usual steps of a CBIR system.
17 Should a clinician using the system disagree with the proposed automated diagnosis (step 4), the nearest neighbors can be displayed and used by the clinician to revise the diagnosis.
The use of the wavelet transform
20 has been proposed in previous works
13,14 to characterize the digital content of images, and the superiority of this methodology over several alternatives has been shown.
13,14 This approach was improved further in the present study, in that a novel set of wavelet filters was introduced. Two wavelet adaptation algorithms of increasing complexity, referred to as the
Global and
Local algorithms, are presented in the Appendix.
A third algorithm, referred to as the
Fusion algorithm, was evaluated: steps 1 and 2 were based on local wavelet adaptation and steps 3 and 4 were improved, as explained hereafter. A recently published information fusion algorithm from our group, based on Bayesian networks and the Dezert-Smarandache theory,
15 was used to combine the characterizations of all images from a query eye (or patient), as well as demographic data, to find the
k most similar eyes (or patients) in the training subset. The most frequent diagnosis among these
k nearest neighbors (with respect to the
reference standard) was used as the automated diagnosis for the query eye (or patient).
A fourth algorithm, referred to as the NoAngiography algorithm, was evaluated: It is similar to the Fusion algorithm, except that it is masked to all angiographs. Should the proposed algorithm perform equally well without angiographs, we might recommend that no (or less) fluorescein injection be performed, which would allow the imaging session to be streamlined and less invasive.
Each algorithm (Local, Global, Fusion and NoAngiography) was tuned to maximize Cohen's κ between its outputs and the reference standard in the training subset (see Training and Testing Subsets). In particular, k, the number of nearest neighbors (see Automated DR Severity Assessment Using CBIR), was trained by leave-one-out cross-validation in the training subset. Agreement between each algorithm and clinicians was then assessed, in the testing subset, using the optimal value for k.
In this article, the diabetic retinopathy (DR) grading performance of CBIR algorithms from our group was compared to the performance of two clinicians with different experience levels.
First, interclinician agreement (κ = 0.493 at eye level; κ = 0.526 at patient level) was much lower than intraclinician agreement (κ = 0.809 at eye level; κ = 0.769 at patient level), at least for clinicians with different experience levels. Note, however, that the diagnoses of
Clinician2 (the least experienced clinician) seldom differed from those of
Clinician1 (the most experienced clinician) by more than one severity level (
Tables 1,
2); wider divergences were observed more often between algorithms and
Clinician1 (
Table 3).
Second, the simplest CBIR algorithms (
Global and
Local), which combine image characterizations in a basic way, were less efficient than
Clinician2 in terms of Cohen's κ and weighted κ (
Tables 4,
5). On the other hand, the performance of the most advanced algorithm (
Fusion), which elegantly combines image characterizations and demographic data, compared favorably to the performance of
Clinician2 (κ = 0.573 at eye level, κ = 0.592 at patient level).
Third, we found that masking the Fusion algorithm to all angiographs noticeably decreased diagnosis performance (κ = 0.466 at eye level; κ = 0.457 at patient level). This performance decrease may be due to the higher discrimination power of angiography over other image modes. However, it may also be because nasal and temporal fields were photographed only after fluorescein injection. Further analyses are therefore needed to draw conclusions about the usefulness of angiography for automated DR severity assessment.
Fourth, the potential usefulness of CBIR algorithms as a second opinion, to assist the least experienced clinicians, has been shown (
Tables 6,
7). In particular, whenever
Clinician2 disagreed with the algorithm at patient level, there was a 73.81% probability that he also disagreed with
Clinician1, as opposed to 38.82% without prior knowledge (
P = 0.0002). This result could serve as a warning that he should revise his diagnosis. Similarly, whenever
Clinician2 agreed with the algorithm, there was a 95.35% probability that he also agreed with
Clinician1, as opposed to 61.18% without prior knowledge (
P < 0.0001). This should increase his confidence.
Note that each of the above observations was made both at eye level and at patient level.
We believe the fourth observation is of great practical value. We propose that an algorithm be used in the context of DR severity assessment, as a second opinion, to help validate or invalidate the diagnoses of young clinicians. Because such algorithms are able to provide a diagnosis in seconds,
15 they may be embedded in clinicians' workstations and display the proposed diagnosis on a screen. One advantage of the CBIR approach, over traditional computer-assisted diagnosis (CADx), is its interactivity: Should the clinician disagree with the proposed second opinion, he or she may visualize (also from a workstation) the
k nearest neighbors from the reference dataset that were used to infer the second opinion, together with their medical interpretations from more experienced clinicians. This feature would help the clinician (1) to see whether the algorithm obviously made an error and, if not, (2) to compare his or her interpretation with that of his or her peers on similar cases.
More generally, the proposed CBIR-based approach may be used to pool diagnostic knowledge among peers (either hospitalwide or nationwide), which has several possible applications. First, it may be used for training: Interns may now be able to compare their interpretations of real-life cases with those of renowned experts. Second, it may help in reducing interclinician variability and therefore help clinicians to coordinate their clinical decisions and prescriptions in a DR-grading program (e.g., to determine which patients should undergo a particular treatment), as is already done in screening programs.
7,8
In conclusion, this preliminary study paves the way to the use of CBIR algorithms in clinical practice as a second opinion, to help validate or invalidate the diagnoses of young clinicians.
Let
I be an input image of size
M ×
N, and let
w be a wavelet filter of size (2
K + 1)(2
L + 1). Filter
w is used to extract information from
I at a given analysis scale, in a given direction. The convolution of
I and translated versions of
w lead to the following set (referred to as a subband):
where
s is the analysis scale. By varying the filter's aspect ratio (
K/
L), we can obtain subbands associated with different directions (horizontal, vertical, and nondirectional).
The coefficients of filter w were tuned (as described in §a or §b) to increase the performance of DR severity assessment in the training subset.
To characterize the digital content of image
I, the distribution of the
xs ;K,L (
i,
j) coefficients is modeled in several subbands. Because the
xs ;K,L (
i,
j) coefficients in each subband have a 0-mean generalized Gaussian distribution (for small values of
s),
13,23 their distribution can be efficiently modeled by their standard deviation σ
s;K, L (
i,
j) and kurtosis κ
s ;K,L (
I):
where
ms ;K,L,d (
I) is the
dth order moment of the distribution.
The proposed image characterization is a feature vector consisting of the [σ s ;K,L (I),κ s ;K,L (I)] couples extracted in several subbands. To characterize the lowest frequencies (corresponding to s→∞), an intensity histogram of I was also included in the proposed image characterization.
Distance
D(
I,
J) between the characterization of image
I and that of image
J was defined as follows
13 :
where
D H(
I,
J) denotes the Euclidean distance between the intensity histograms of
I and
J; α
s ;K,L and β
s ;K,L are subband weights.
a. Global Adaptation of Image Characterizations and Distance Measures
b. Local Adaptation of Image Characterizations and Distance Measures
c. Training the Mapping Functions for Local Filter and Weight Adaptation
d. Derivatives of the Proposed Image Characterizations with Respect to Wavelet Filter Coefficients