**Purpose.**:
Recent studies on diabetic retinopathy (DR) screening in fundus photographs suggest that disagreements between algorithms and clinicians are now comparable to disagreements among clinicians. The purpose of this study is to (1) determine whether this observation also holds for automated DR severity assessment algorithms, and (2) show the interest of such algorithms in clinical practice.

**Methods.**:
A dataset of 85 consecutive DR examinations (168 eyes, 1176 multimodal eye fundus photographs) was collected at Brest University Hospital (Brest, France). Two clinicians with different experience levels determined DR severity in each eye, according to the International Clinical Diabetic Retinopathy Disease Severity (ICDRS) scale. Based on Cohen's kappa (κ) measurements, the performance of clinicians at assessing DR severity was compared to the performance of state-of-the-art content-based image retrieval (CBIR) algorithms from our group.

**Results.**:
At assessing DR severity in each patient, intraobserver agreement was κ = 0.769 for the most experienced clinician. Interobserver agreement between clinicians was κ = 0.526. Interobserver agreement between the most experienced clinicians and the most advanced algorithm was κ = 0.592. Besides, the most advanced algorithm was often able to predict agreements and disagreements between clinicians.

**Conclusions.**:
Automated DR severity assessment algorithms, trained to imitate experienced clinicians, can be used to predict when young clinicians would agree or disagree with their more experienced fellow members. Such algorithms may thus be used in clinical practice to help validate or invalidate their diagnoses. CBIR algorithms, in particular, may also be used for pooling diagnostic knowledge among peers, with applications in training and coordination of clinicians' prescriptions.

^{ 1,2 }Detecting and monitoring DR in the at-risk population (diabetics), generally using eye fundus photography, is crucial for providing timely treatment of DR and therefore preventing visual loss.

^{ 3 }With the likelihood of an increase in the at-risk population,

^{ 4 }many computer image analysis algorithms have been proposed in the literature to analyze fundus photographs automatically.

^{ 5,6 }Recently, several image-analysis groups compared the performance of their image analysis algorithms with the performance of clinicians in detecting DR in large-scale screening programs.

^{ 7 –10 }Disagreements between algorithms and clinicians were found to be comparable to disagreements among clinicians.

^{ 9 }

^{ 11 }In comparison, grading (i.e., monitoring) DR is much more complicated: according to the International Clinical Diabetic Retinopathy Disease Severity (ICDRS) scale, it involves detecting additional types of lesions (e.g., neovascularizations and intraretinal microvascular abnormalities) with a larger variability of scales and shapes.

^{ 12 }As a consequence, little work has been done so far to automate DR grading, in comparison with DR screening. However, several algorithms of increasing complexity have recently been proposed by our group for automated DR grading using fundus photography,

^{ 13,14 }together with demographic data in the most advanced algorithms.

^{ 15,16 }These algorithms are all based on the content-based image retrieval (CBIR) paradigm, which is explained hereafter.

^{ 17,18 }In CBIR, the content of images is characterized with low-level features (e.g., color, texture, and local shape) and these features are mapped to concepts (e.g., DR severity) using machine learning; in particular, it is not necessary to develop a dedicated algorithm for each lesion type. Once a query image has been automatically characterized, the most similar images, together with their medical interpretations, are searched for in a reference database. These most similar images are then used to infer an automated diagnosis for the query image. Note that a CBIR approach, relying on dedicated algorithms and limited manual inputs, has also been proposed by another image analysis group.

^{ 19 }

^{ 7,9 }—these algorithms may be used for triage, in that physicians are unable to screen the entire at-risk population—but those of automated DR grading have not. The specific advantages of CBIR in this context are emphasized.

*Clinician1*) and the other with 2 years' (

*Clinician2*) experience. Each clinician was asked to grade disease severity in all 168 eyes according to a modified ICDRS scale. This scale consists of the five ICDRS levels: 0, no apparent DR; 1, mild nonproliferative DR; 2, moderate nonproliferative DR; 3, severe nonproliferative DR; and 4, proliferative DR,

^{ 12 }as well as an additional level (5, treated DR). The clinicians interpreted the 168 eyes in randomized order. When interpreting one eye, the clinicians had access to all seven photographs, as well as the available demographic data, but they were masked to all photographs and interpretations from the contralateral eye.

*Clinician1*interpreted the dataset a second time. Therefore, three interpretations are available at eye level:

*Clinician1*contributed

*EyeGrades1a*and

*EyeGrades1b*;

*Clinician2*contributed

*EyeGrades2*.

*PatientGrades1a*,

*PatientGrades1b*, and

*PatientGrades2*.

*EyeGrades1a*and

*PatientGrades1a*were used as the

*reference standard*for algorithm supervision at eye level and patient level, respectively. Performance was assessed by two-fold cross-validation. At first, subset A was used for training (i.e., tuning the algorithms), and subset B was used for testing (i.e., comparing the outputs of the algorithms with the

*reference standard*). Then, subset B was used for training, and subset A was used for testing.

*Q*, the following procedure was applied:

- The digital content of each image
*I*in_{q}*Q*and of each image*I*_{ts }in the training subset was automatically characterized by a feature vector. - The distance between the characterization of
*I*and that of each image_{q}*I*_{ts }in the training subset was computed. - The
*k*nearest neighbors of*I*, within the training subset, were sought with respect to the distance measure in step 2._{q} - The most frequent diagnosis among the nearest neighbors of every image
*I*in_{q}*Q*(according to the*reference standard*) was assigned to*Q*.

^{ 17 }Should a clinician using the system disagree with the proposed automated diagnosis (step 4), the nearest neighbors can be displayed and used by the clinician to revise the diagnosis.

^{ 20 }has been proposed in previous works

^{ 13,14 }to characterize the digital content of images, and the superiority of this methodology over several alternatives has been shown.

^{ 13,14 }This approach was improved further in the present study, in that a novel set of wavelet filters was introduced. Two wavelet adaptation algorithms of increasing complexity, referred to as the

*Global*and

*Local*algorithms, are presented in the Appendix.

*Fusion*algorithm, was evaluated: steps 1 and 2 were based on local wavelet adaptation and steps 3 and 4 were improved, as explained hereafter. A recently published information fusion algorithm from our group, based on Bayesian networks and the Dezert-Smarandache theory,

^{ 15 }was used to combine the characterizations of all images from a query eye (or patient), as well as demographic data, to find the

*k*most similar eyes (or patients) in the training subset. The most frequent diagnosis among these

*k*nearest neighbors (with respect to the

*reference standard*) was used as the automated diagnosis for the query eye (or patient).

*NoAngiography*algorithm, was evaluated: It is similar to the

*Fusion*algorithm, except that it is masked to all angiographs. Should the proposed algorithm perform equally well without angiographs, we might recommend that no (or less) fluorescein injection be performed, which would allow the imaging session to be streamlined and less invasive.

^{ 21 }and weighted κ (κ

_{w}).

^{ 22 }The equation for κ is: where

*p*

_{a}is the observed probability of agreement and

*p*

_{ca}is the probability of chance agreement. The simplest and more standard weighting scheme was used for κ

_{w}: a disagreement of

*n*

_{sl}severity levels was weighted by

*n*

_{sl}.

*Local*,

*Global*,

*Fusion*and

*NoAngiography*) was tuned to maximize Cohen's κ between its outputs and the

*reference standard*in the training subset (see Training and Testing Subsets). In particular,

*k*, the number of nearest neighbors (see Automated DR Severity Assessment Using CBIR), was trained by leave-one-out cross-validation in the training subset. Agreement between each algorithm and clinicians was then assessed, in the testing subset, using the optimal value for

*k*.

*reference standard*provided by the most experienced clinician (

*Clinician1*), were used to automatically predict when the least experienced clinician (

*Clinician2*) disagrees with his more experienced fellow member. We compared the probability of agreement (or disagreement) between

*Clinician1*and

*Clinician2*, knowing that

*Clinician2*agrees (or disagrees) with a given algorithm, to the overall probability of agreement (or disagreement) between

*Clinician1*and

*Clinician2*. A two-tailed test for difference between proportions was performed.

EyeGrades1a | |||||||
---|---|---|---|---|---|---|---|

0 | 1 | 2 | 3 | 4 | 5 | Total | |

EyeGrades1b | |||||||

0 | 23 | 4 | 0 | 0 | 1 | 0 | 28 |

1 | 0 | 36 | 11 | 0 | 0 | 0 | 47 |

2 | 0 | 1 | 23 | 3 | 2 | 1 | 30 |

3 | 0 | 0 | 1 | 34 | 0 | 0 | 35 |

4 | 0 | 0 | 0 | 1 | 10 | 0 | 11 |

5 | 0 | 0 | 0 | 0 | 1 | 16 | 17 |

Total | 23 | 41 | 35 | 38 | 14 | 17 | 168 |

EyeGrades2 | |||||||

0 | 15 | 0 | 0 | 0 | 0 | 0 | 15 |

1 | 5 | 19 | 2 | 0 | 0 | 0 | 26 |

2 | 2 | 20 | 22 | 4 | 1 | 0 | 49 |

3 | 1 | 1 | 8 | 18 | 1 | 2 | 31 |

4 | 0 | 1 | 0 | 9 | 10 | 1 | 21 |

5 | 0 | 0 | 3 | 7 | 2 | 14 | 26 |

Total | 23 | 41 | 35 | 38 | 14 | 17 | 168 |

PatientGrades1a | |||||||
---|---|---|---|---|---|---|---|

0 | 1 | 2 | 3 | 4 | 5 | Total | |

PatientGrades1b | |||||||

0 | 9 | 2 | 0 | 0 | 0 | 0 | 11 |

1 | 0 | 15 | 7 | 0 | 0 | 0 | 22 |

2 | 0 | 1 | 11 | 2 | 2 | 1 | 17 |

3 | 0 | 0 | 0 | 18 | 0 | 0 | 18 |

4 | 0 | 0 | 0 | 1 | 7 | 0 | 8 |

5 | 0 | 0 | 0 | 0 | 0 | 9 | 9 |

Total | 9 | 18 | 18 | 21 | 9 | 10 | 85 |

PatientGrades2 | |||||||

0 | 6 | 0 | 0 | 0 | 0 | 0 | 6 |

1 | 3 | 10 | 1 | 0 | 0 | 0 | 14 |

2 | 0 | 8 | 13 | 3 | 1 | 0 | 25 |

3 | 0 | 0 | 3 | 10 | 1 | 1 | 15 |

4 | 0 | 0 | 0 | 4 | 6 | 2 | 12 |

5 | 0 | 0 | 1 | 4 | 1 | 7 | 13 |

Total | 9 | 18 | 18 | 21 | 9 | 10 | 85 |

*Fusion*) versus the

*reference standard*, in terms of DR severity at eye and patient level, are given in Table 3. The optimal value for

*k*was

*k*= 5 on subset A and

*k*= 7 on subset B. Agreement between all algorithms and clinicians, at eye and patient levels, are reported in Tables 4 and 5, respectively.

Reference Standard | |||||||
---|---|---|---|---|---|---|---|

0 | 1 | 2 | 3 | 4 | 5 | Total | |

Fusion at eye level | |||||||

0 | 12 | 2 | 0 | 0 | 0 | 0 | 14 |

1 | 6 | 27 | 2 | 1 | 0 | 0 | 36 |

2 | 4 | 4 | 25 | 7 | 2 | 1 | 43 |

3 | 1 | 7 | 5 | 25 | 2 | 1 | 41 |

4 | 0 | 1 | 3 | 3 | 7 | 1 | 15 |

5 | 0 | 0 | 0 | 2 | 3 | 14 | 19 |

Total | 23 | 41 | 35 | 38 | 14 | 17 | 168 |

Fusion at patient level | |||||||

0 | 6 | 1 | 0 | 0 | 0 | 0 | 7 |

1 | 2 | 13 | 3 | 1 | 0 | 0 | 19 |

2 | 1 | 0 | 13 | 5 | 1 | 1 | 21 |

3 | 0 | 4 | 2 | 13 | 2 | 1 | 22 |

4 | 0 | 0 | 0 | 1 | 5 | 1 | 7 |

5 | 0 | 0 | 0 | 1 | 1 | 7 | 9 |

Total | 9 | 18 | 18 | 21 | 9 | 10 | 85 |

EyeGrades1a | EyeGrades1b | EyeGrades2 | Global | Local | Fusion | NoAngiography | |
---|---|---|---|---|---|---|---|

Cohen's κ | |||||||

EyeGrades1a | 1 | 0.809 | 0.493 | 0.387 | 0.456 | 0.573 | 0.466 |

EyeGrades1b | 1 | 0.410 | 0.346 | 0.422 | 0.509 | 0.446 | |

EyeGrades2 | 1 | 0.411 | 0.380 | 0.391 | 0.340 | ||

Global | 1 | 0.683 | 0.516 | 0.658 | |||

Local | 1 | 0.689 | 0.730 | ||||

Fusion | 1 | 0.738 | |||||

NoAngiography | 1 | ||||||

Weighted κ | |||||||

EyeGrades1a | 1 | 0.884 | 0.678 | 0.538 | 0.626 | 0.696 | 0.614 |

EyeGrades1b | 1 | 0.636 | 0.507 | 0.591 | 0.656 | 0.604 | |

EyeGrades2 | 1 | 0.595 | 0.598 | 0.586 | 0.536 | ||

Global | 1 | 0.784 | 0.660 | 0.755 | |||

Local | 1 | 0.774 | 0.786 | ||||

Fusion | 1 | 0.826 | |||||

NoAngiography | 1 |

PatientGrades1a | PatientGrades1b | PatientGrades2 | Global | Local | Fusion | NoAngiography | |
---|---|---|---|---|---|---|---|

Cohen's κ | |||||||

PatientGrades1a | 1 | 0.769 | 0.526 | 0.384 | 0.450 | 0.592 | 0.457 |

PatientGrades1b | 1 | 0.485 | 0.375 | 0.410 | 0.578 | 0.431 | |

PatientGrades2 | 1 | 0.397 | 0.382 | 0.391 | 0.436 | ||

Global | 1 | 0.778 | 0.556 | 0.757 | |||

Local | 1 | 0.723 | 0.734 | ||||

Fusion | 1 | 0.673 | |||||

NoAngiography | 1 | ||||||

Weighted κ | |||||||

PatientGrades1a | 1 | 0.861 | 0.714 | 0.549 | 0.633 | 0.717 | 0.628 |

PatientGrades1b | 1 | 0.692 | 0.520 | 0.600 | 0.720 | 0.607 | |

PatientGrades2 | 1 | 0.583 | 0.623 | 0.597 | 0.591 | ||

Global | 1 | 0.851 | 0.677 | 0.794 | |||

Local | 1 | 0.789 | 0.819 | ||||

Fusion | 1 | 0.792 | |||||

NoAngiography | 1 |

P(EyeGrades2=EyeGrades1a) | P(EyeGrades2=EyeGrades1a| EyeGrades2=Algorithm) | P | |
---|---|---|---|

Algorithm: test of agreement | |||

Global | 74.71% | 0.0098 | |

Local | 80.72% | 0.0004 | |

Fusion | 58.33% | 90.59% | <0.0001 |

NoAngiography | 84.62% | <0.0001 |

P(EyeGrades2≠EyeGrades1a) | P(EyeGrades2≠EyeGrades1a| EyeGrades2≠Algorithm) | P | |
---|---|---|---|

Algorithm: test of disagreement | |||

Global | 59.26% | 0.0092 | |

Local | 63.53% | 0.0010 | |

Fusion | 41.67% | 74.70% | <0.0001 |

NoAngiography | 64.44% | 0.0005 |

P(PatientGrades2=PatientGrades1a) | P(PatientGrades2=PatientGrades1a| PatientGrades2=Algorithm) | P | |
---|---|---|---|

Algorithm: test of agreement | |||

Global | 79.07% | 0.0417 | |

Local | 88.10% | 0.0018 | |

Fusion | 61.18% | 95.35% | <0.0001 |

NoAngiography | 82.61% | 0.0116 |

P(PatientGrades2≠PatientGrades1a) | P(PatientGrades2≠PatientGrades1a| PatientGrades2≠Algorithm) | P | |
---|---|---|---|

Algorithm: test of disagreement | |||

Global | 57.14% | 0.0508 | |

Local | 65.12% | 0.0049 | |

Fusion | 38.82% | 73.81% | 0.0002 |

NoAngiography | 64.10% | 0.0088 |

*Clinician2*(the least experienced clinician) seldom differed from those of

*Clinician1*(the most experienced clinician) by more than one severity level (Tables 1, 2); wider divergences were observed more often between algorithms and

*Clinician1*(Table 3).

*Global*and

*Local*), which combine image characterizations in a basic way, were less efficient than

*Clinician2*in terms of Cohen's κ and weighted κ (Tables 4, 5). On the other hand, the performance of the most advanced algorithm (

*Fusion*), which elegantly combines image characterizations and demographic data, compared favorably to the performance of

*Clinician2*(κ = 0.573 at eye level, κ = 0.592 at patient level).

*Fusion*algorithm to all angiographs noticeably decreased diagnosis performance (κ = 0.466 at eye level; κ = 0.457 at patient level). This performance decrease may be due to the higher discrimination power of angiography over other image modes. However, it may also be because nasal and temporal fields were photographed only after fluorescein injection. Further analyses are therefore needed to draw conclusions about the usefulness of angiography for automated DR severity assessment.

*Clinician2*disagreed with the algorithm at patient level, there was a 73.81% probability that he also disagreed with

*Clinician1*, as opposed to 38.82% without prior knowledge (

*P*= 0.0002). This result could serve as a warning that he should revise his diagnosis. Similarly, whenever

*Clinician2*agreed with the algorithm, there was a 95.35% probability that he also agreed with

*Clinician1*, as opposed to 61.18% without prior knowledge (

*P*< 0.0001). This should increase his confidence.

^{ 15 }they may be embedded in clinicians' workstations and display the proposed diagnosis on a screen. One advantage of the CBIR approach, over traditional computer-assisted diagnosis (CADx), is its interactivity: Should the clinician disagree with the proposed second opinion, he or she may visualize (also from a workstation) the

*k*nearest neighbors from the reference dataset that were used to infer the second opinion, together with their medical interpretations from more experienced clinicians. This feature would help the clinician (1) to see whether the algorithm obviously made an error and, if not, (2) to compare his or her interpretation with that of his or her peers on similar cases.

^{ 7,8 }

*I*be an input image of size

*M*×

*N*, and let

*w*be a wavelet filter of size (2

*K*+ 1)(2

*L*+ 1). Filter

*w*is used to extract information from

*I*at a given analysis scale, in a given direction. The convolution of

*I*and translated versions of

*w*lead to the following set (referred to as a subband): where

*s*is the analysis scale. By varying the filter's aspect ratio (

*K*/

*L*), we can obtain subbands associated with different directions (horizontal, vertical, and nondirectional).

*w*were tuned (as described in §a or §b) to increase the performance of DR severity assessment in the training subset.

*I*, the distribution of the

*x*

_{s}_{;K,L }(

*i*,

*j*) coefficients is modeled in several subbands. Because the

*x*

_{s}_{;K,L }(

*i*,

*j*) coefficients in each subband have a 0-mean generalized Gaussian distribution (for small values of

*s*),

^{ 13,23 }their distribution can be efficiently modeled by their standard deviation σ

_{ s;K, L }(

*i*,

*j*) and kurtosis κ

_{s}_{;K,L }(

*I*): where

*m*

_{s}_{;K,L,d }(

*I*) is the

*d*th order moment of the distribution.

_{s}_{;K,L }(

*I*),κ

_{s}_{;K,L }(

*I*)] couples extracted in several subbands. To characterize the lowest frequencies (corresponding to s→∞), an intensity histogram of

*I*was also included in the proposed image characterization.

*K*,

*L*). Small filter sizes proved sufficient in previous works,

^{ 13,14 }and so the following couples were used:

^{ 20 }; consequently, there were four undetermined coefficients in the first two filters and eight in the last one. As in previous work, three analysis scales were used in this study:

*s*= 1,

*s*= 2, and

*s*= 3; overall, there were 18 undetermined weights in equation 4.

*reference standard*, in the training subset. A genetic algorithm was used to find the optimal set of coefficients.

^{ 13,24 }

^{ 12 }it is likely that the boundaries between severity levels in the area of image characterizations are complex. Therefore, in this second scenario, the wavelet filters and distance measures were allowed to vary

*continuously*in image characterization space. A continuous function was used to map

*d*

_{0}(

*I*), the initial characterization of an image

*I*, obtained by the initial set of filters {

*V*

_{0},

*H*

_{0},

*ND*

_{0}} (see §a), to a new set of filters {

*V*(

*I*),

*H*(

*I*),

*ND*(

*I*)} and a new set of weights

*W*(

*I*).

*Continuity*means that two images with similar initial characterizations are mapped to similar sets of filters and weights; the following equation holds for

*V*(

*I*) [similar equations hold for

*H*(

*I*),

*ND*(

*I*), and

*W*(

*I*)]:

*I*

_{ts }in the training subset {

*V*(

*I*

_{ts }),

*H*(

*I*

_{ts }),

*ND*(

*I*

_{ts })} and

*W*(

*I*

_{ts }) were tuned to maximize severity assessment performance for

*I*

_{ts }, while respecting the continuity constraint (see §c). The continuity constraint prevented the algorithm from overfitting the training data. Moreover, it was necessary to define {

*V*(

*I*),

_{q}*H*(

*I*),

_{q}*ND*(

*I*)} and

_{q}*W*(

*I*), where

_{q}*I*is an image in the testing subset. These coefficients were obtained by multivariate interpolation

_{q}^{ 25 }: Similar equations hold for

*H*(

*I*),

_{q}*ND*(

*I*), and

_{q}*W*(

*I*). These new filters and weights were used to find the

_{q}*k*most similar neighbors of

*I*.

*I*

_{ts }be an image from the training subset. A gradient descent was used to find {

*V*(

*I*

_{ts }),

*H*(

*I*

_{ts }),

*ND*(

*I*

_{ts }),

*W*(

*I*

_{ts })}, using {

*V*

_{0},

*H*

_{0},

*ND*

_{0},

*W*

_{0}} as the initial point.

*j*), let

*J*

_{ts }

^{(j)}be the nearest neighbor of

*I*

_{ts }with a different severity level from

*I*

_{ts }, and let

*K*

_{ts }

^{(j)}be the nearest neighbor of

*I*

_{ts }with the same severity level as

*I*

_{ts }, such that

*D*

^{(j)}(

*I*

_{ts },

*K*

_{ts }

^{(j)}) ≥

*D*

^{(j)}(

*I*

_{ts },

*J*

_{ts }

^{(j)}). One way to increase severity assessment performance for

*I*

_{ts }is to minimize the ratio between

*S*

^{(j)}=

*D*

^{(j)}(

*I*

_{ts },

*K*

_{ts }

^{(j)}) and

*O*

^{(j)}=

*D*

^{(j)}(

*I*

_{ts },

*J*

_{ts }

^{(j)}). To respect the continuity constraint, each coefficient

*x*in {

*V*(

*I*

_{ts }),

*H*(

*I*

_{ts }),

*ND*(

*I*

_{ts }),

*W*(

*I*

_{ts })} was increased by Δ

*x*

^{(j)}(

*I*

_{ts }) at each step (

*j*): where λ > 0, which controls the regularity of the mapping functions, was obtained by cross-validation on the training subset; the same value for λ was used in equation 6. If

*x*is a weight (respectively a wavelet coefficient), then equation 7 can be computed using equation 4 (respectively, equation 4 and §d).

*k*,

*l*) be the coordinates of one coefficient in

*w*[

*k*ϵ(−K,…,K),

*l*ϵ (−

*L*,…

*L*)] with respect to which we would like to compute the derivative of

*I*'s characterization (see equation 3).

*d*th order moment

*m*

_{s}_{;K,L,d}(

*I*) of the distribution (see equation 3) with respect to

*w*(see equation 2): The derivative of

_{k,l}*I*'s characterization, with respect to

*w*, is given by:

_{k,l}