Abstract
Purpose.:
We
evaluated inter-examiner reliability in grading of clinical variables associated with meibomian gland dysfunction (MGD) in real-time examination versus a graded digital image.
Methods.:
Meibography grading of meibomian gland atrophy and acini appearance, and slit-lamp grading of lid debris and telangiectasias were conducted on 410 post-menopausal women. Meibography and slit-lamp photos were captured digitally and saved for analysis by a masked examiner. Gland atrophy was graded as a proportion of partial glands in the lower lid, and acini appearance by the presence/absence of grape-like clusters. Lid debris and telangiectasias were graded based on severity and quantity from the same image, respectively. Observed agreement and weighted kappas (κw) with 95% confidence intervals (CI) determined the degree of inter-examiner reliability between grading of these clinical variables in real-time examination and digital photographs using a multiple-point categorical scale.
Results.:
Observed agreement was determined for telangiectasias (40.6%), lid debris (50.9%), gland dropout (42.8%), and acini appearance (54.5%). Inter-examiner reliability for the four clinical outcomes ranged from fair agreement for acini appearance (κw = 0.23, 95% CI = 0.14–0.32) and lid debris (κw = 0.24, 0.16–0.32) to moderate agreement for gland dropout (κw = 0.50, 0.40–0.59) and telangiectasias (κw = 0.47, 0.39–0.55).
Conclusions.:
Gland dropout and potentially lid telangiectasia grading from a photograph are more representative of grading in a real-time examination compared to acini appearance and lid debris. Alternative grading scales and/or clinical variables associated with MGD should be addressed in future studies.
All images were saved with a coded identification number and graded later by a masked examiner who was trained to read the archived images. Two static images of the central lower lids for each subject were captured using a photo slit-lamp (color images) and during the meibography procedure. The image examiner analyzed the first of two images captured from the right eye, and only in the event that the first photo was of substandard image quality (poor focus, dim illumination, or poor centration) was the second image then opened, evaluated, and graded. The image examiner did not grade any images from the left eye. Therefore, all data used in the statistical analyses were from the right eye only.
Grades from real-time examination, and images from slit-lamp examination and meibography were collected from 410 subjects. The overall average age was 62.3 ± 8.8 years with a median age of 61 years (46–91 years, 100% women). Of the 410 images collected from the right eye, 11 (2.5%) from the slit-lamp examination and 10 (2.4%) from meibography were deemed not gradable because of poor image quality (illumination, or out of focus), equipment malfunction (not saved, or power failure), or not enough lid eversion to appreciate the meibomian glands during the meibography procedure.
The inter-examiner reliability data associated with the grading scales for their respective clinical outcomes are illustrated in
Tables 2–
5. The overall agreement between real-time and digital image grading was 40.1% for lid telangiectasias (
Table 2), 50.9% for lid debris (
Table 3), 42.8% for gland dropout (
Table 4), and 54.5% for acini appearance (
Table 5). The unweighted and weighted κ values (with 95% CI) for lid telangiectasias, lid debris, gland dropout, and acini appearance are presented in
Table 6. For grading of gland dropout and lid telangiectasias, there was fair-to-moderate reliability between the real-time and image examination. Reliability for lid debris and acini appearance fared worse, as only slight-to-fair reliability was appreciated between the real-time and image examiner.
Table 2.
Between Examiner-Reader Observations for Overall Lid Telangiectasias Grading of Severity (Cell Data Represents Number of Observations)
Table 2.
Between Examiner-Reader Observations for Overall Lid Telangiectasias Grading of Severity (Cell Data Represents Number of Observations)
| Masked Reader | Row Totals (%) |
Grade 0 | Grade 1 | Grade 2 | Grade 3 | Grade 4 |
Examiner |
Grade 0 | 95 | 14 | 5 | 0 | 0 | 116 (29.1) |
Grade 1 | 37 | 26 | 9 | 2 | 0 | 74 (18.5) |
Grade 2 | 40 | 42 | 36 | 5 | 0 | 123 (30.8) |
Grade 3 | 11 | 21 | 36 | 3 | 0 | 71 (17.8) |
Grade 4 | 0 | 1 | 8 | 6 | 0 | 15 (3.8) |
Column totals (%) | 185 (46.4) | 104 (26.1) | 94 (23.5) | 16 (4.0) | 0 (0.0%) | 399 |
Table 3.
Between Examiner-Reader Observations for Overall Lid Debris Grading of Severity (Cell Data Represents the Number of Observations)
Table 3.
Between Examiner-Reader Observations for Overall Lid Debris Grading of Severity (Cell Data Represents the Number of Observations)
| Masked Reader | Row Totals (%) |
Grade 0 | Grade 1 | Grade 2 | Grade 3 |
Examiner |
Grade 0 | 163 | 24 | 4 | 0 | 191 (47.9) |
Grade 1 | 110 | 31 | 6 | 0 | 147 (36.8) |
Grade 2 | 30 | 13 | 7 | 1 | 51 (12.8) |
Grade 3 | 5 | 1 | 2 | 2 | 10 (2.5) |
Column totals (%) | 308 (77.2) | 69 (17.3) | 19 (4.8) | 3 (0.7) | 399 |
Table 4.
Between Examiner-Reader Observations for Overall Meibomian Gland Dropout Grades of Severity (Cell Data Represents the Number of Observations)
Table 4.
Between Examiner-Reader Observations for Overall Meibomian Gland Dropout Grades of Severity (Cell Data Represents the Number of Observations)
| Masked Reader | Row Totals (%) |
Grade 1 | Grade 2 | Grade 3 | Grade 4 |
Examiner |
Grade 1 | 15 | 47 | 12 | 5 | 79 (19.8) |
Grade 2 | 7 | 74 | 25 | 17 | 123 (30.7) |
Grade 3 | 1 | 49 | 32 | 33 | 115 (28.8) |
Grade 4 | 0 | 11 | 22 | 50 | 83 (20.7) |
Column totals (%) | 23 (5.8) | 181 (45.2) | 91 (22.8) | 105 (26.2) | 400 |
Table 5.
Between Examiner-Reader Observations for Overall Meibomian Gland Acini Appearance Grading of Severity (Cell Data Represents the Number of Observations)
Table 5.
Between Examiner-Reader Observations for Overall Meibomian Gland Acini Appearance Grading of Severity (Cell Data Represents the Number of Observations)
| Masked Grader | Row Totals (%) |
Grade 1 | Grade 2 | Grade 3 |
Examiner |
Grade 1 | 22 | 51 | 16 | 89 (22.3) |
Grade 2 | 46 | 171 | 55 | 272 (68.0) |
Grade 3 | 0 | 14 | 25 | 39 (9.7) |
Column totals (%) | 68 (17.0) | 236 (59.0) | 96 (24.0) | 400 |
Table 6.
Unweighted and Weighted κ Values for Evaluating Agreement of the Four Clinical Variables between Paired Grades of the Real-Time and Image Examiner
Table 6.
Unweighted and Weighted κ Values for Evaluating Agreement of the Four Clinical Variables between Paired Grades of the Real-Time and Image Examiner
Clinical Variable | Unweighted (95% CI) | Weighted (95% CI) |
Lid telangiectasias | 0.19 (0.14–0.25) | 0.47 (0.39–0.55) |
Lid debris | 0.12 (0.06–0.19) | 0.24 (0.16–0.32) |
Gland dropout | 0.22 (0.16–0.27) | 0.50 (0.40–0.59) |
Acini appearance | 0.15 (0.08–0.22) | 0.23 (0.14–0.32) |
The mean difference ± SD in overall grades was 0.63 ± 1.00 (median = 1) and 0.41 ± 0.83 (median = 0) of a grade for lid telangiectasias (
Fig. 1A) and debris (
Fig. 1B), respectively, and −0.19 ± 0.97 (median = 0) and −0.20 ± 0.73 (median = 0) of a grade for gland dropout (
Fig. 2A) and acini appearance (
Fig. 2B), respectively. (A positive mean difference value indicates that the real-time examiner overall graded a clinical outcome at a higher grade than the image examiner. A negative mean difference indicates that the image examiner overall graded at a higher grade than the real-time examiner.) Data for gland dropout followed a normal distribution (
P = 0.22); however, data distribution for lid telangiectasias and lid debris were left-skewed (all
P < 0.0001) and right-skewed for acini appearance (
P = 0.02).
Diagnostic tests, photographic interpretations, and physical examination findings rely frequently on a degree of subjective interpretation by clinicians. To determine the usefulness of a clinical outcome measure, reliability and validity must be established. Overall agreement is a method used to evaluate reliability. Although this provides a measurement of agreement, it does not take into account chance agreement or disagreement made by examiners. If examiners are agreeing or disagreeing solely by chance, then no real true measurement of reliability is taking place. This issue can be addressed by the calculation of the κ coefficient.
The κ statistic is the most commonly used statistic for measuring agreement between ratings of two or more examiners (inter-examiner) or by the same examiner on two or more occasions (intra-examiner). The unweighted κ indicates the proportion of agreement that is above what is expected by chance, although it does not differentiate between differences that are due to chance or an examiner's inconsistent grading pattern (systematic error or bias). Unweighted κ values provide a true measure of agreement between or within examiners. Weighted κ coefficients offer a benefit by penalizing disagreements based on the degree by which they are different compared to unweighted kappas, where all disagreements are penalized equally (no value assigned).
The large differences between unweighted and weighted κ values are likely the result of the number of paired observations that disagreed by only one grade. As seen in
Tables 2 and
3, more than 37% (149/399) of lid telangiectasias and 39% (156/399) of lid debris observations from the slit-lamp examination were within one grade of each other. The real-time examiner graded one grade higher than the image examiner, approximately 80% (121/149 for lid telangiectasias and 125/156 for lid debris) of the time (See
Figs. 1A, 1B). This most likely explains the significant left-skewedness of the data distribution as seen in both of these Figures. Likewise, in
Tables 4 and
5, nearly 46% (183/400) and 42% (166/400) of meibomian gland dropout and visible acini observations, respectively, fell within one grade of true agreement. The image examiner graded one grade higher than the real-time examiner just over half the time (57%, or 105/183) for gland dropout (
Fig. 2A), and nearly two-thirds of the time (64%, or 106/166) for acini appearance (
Fig. 2B). These findings likely explain the normal data distribution for the mean difference in grades for gland dropout and a right-skewed distribution (image examiner graded generally higher than the real-time examiner) for acini appearance.
The high percentage of grades for each clinical variable differing by one grade accounted for the greater weighted κ compared to the unweighted κ. Application of these kappa values might depend on the scenario in which they are being used; weighted κ might be relevant to the use and interpretation of these values in the clinical care of patients, where absolute agreement might not be necessary, for example. Likewise, it might be more important to consider the unweighted κ in the evaluation of an outcome in a clinical trial or epidemiological study, where absolute agreement might be more important as it impacts the validity of the study findings.
Depending on the clinical variable evaluated, in some instances the real-time examiner assigned a higher grade than the image examiner (lid telangiectasias and debris), while in other situations, the digital image examiner assigned a grade higher than the real-time examiner (gland dropout and acini visibility). The findings from the slit-lamp examination (lid debris and, to a lesser extent, lid telangiectasias) seemed to have indicated that either the real-time examiner over-graded the actual grade compared to the digital image examiner, or the digital image examiner may have underestimated overall the actual grade compared to the real-time examination. On the contrary, for the outcomes evaluated by meibography (acini appearance and, to a lesser extent, gland dropout), results have revealed assignment of an overall lower grade by the real-time examiner compared to the digital image examiner or vice versa. Besides the non-normal distribution of the mean difference data (lid telangiectasias, lid debris, and acini appearance), the real-time examinations could have benefited from the dynamic examination process to aid accurate grading, whereas single-photo grading eliminates gestalt impressions. Should photographic techniques be used in a clinical trial, multiple images may be needed from different angles or illuminations to simulate the real-time examination.
A limitation of this study was that the digital image examiner was able to grade only a clinical variable based on a single, static electronic image. Any deviations from the standard protocol (image selection, contrast, illumination) as well as image resolution may correspond to increased variability and, thus, a lower κ coefficient. In addition, the digital image examiner was unable to evaluate each subject in a real-time environment as was the examiner. The digital image examiner expressed the difficulty in evaluating for lid debris based on its three-dimensional structure imaged in a two-dimensional photograph. This may explain any under-grading of this variable by the digital image examiner. In contrast to still digital images, video recordings may be more representative as an alternative to grading images in a real-time setting (although challenges would remain in accurately capturing data from video imaging). For our study, however, video recording was not feasible nor possible given the electronic storage capabilities required to archive these recordings for all the subjects who were examined. This would have been more realistic given a smaller sample size, although the inability to detect a significant κ value or provide a CI of a desired precision with this strategy would become more likely. One advantage that the digital image examiner has over the real-time examiner is that more time can be vested in evaluating an image, while the real-time examiner has much shorter time to evaluate and grade a clinical variable because of time constraints and potential subject fatigue. Also, there were no opportunities to re-evaluate each of the outcomes as the subject was seen at a single visit.
Multiple examiners were used to evaluate the clinical outcomes in real-time. All were trained on the standard operating procedures for biomicroscopy and meibography (subject alignment, illumination and width of slit, and magnification), including the grading of clinical parameters before examination of any enrolled subject. Training included instruction on the grading scales for each variable, reviewing samples of each clinical variable and group interaction on assignment of the appropriate grade, protocol for proper image capture (lighting, field of view, focus, and magnification) and electronic storage of static images for future retrieval by the digital image examiner. One explanation for the fair unweighted κ values between the real-time and digital image examiner may lie in variability of grading from one examiner to the next (systematic error) despite the training that each examiner received before examining subjects in this study. Within-examiner analyses of agreement were not conducted because each subject was examined by a single examiner.
In summary, we demonstrated that inter-examiner reliability is at least slight-to-moderate for grading of lid margin telangiectasias, with fair-to-moderate agreement for meibomian gland dropout compared to only slight-to-fair agreement for lid debris and acini appearance. It seems, however, that the degree of agreement may be affected by several factors, including variability within examiners, quality of digital images graded by the examiner (substandard resolution, different field of view for grading clinical variable compared to that of the examiner), and grading bias by the real-time examiners because of their knowledge of each subject's medical and ocular history (presence of meibomian gland dysfunction). The digital image examiner was not privy to any subject information besides what was provided in the images. It appears that assessment of gland dropout may be the best clinical outcome in the assessment of MGD. Future studies investigating new grading schemes, modification of existing grading schemes, and the assessment of additional clinical variables used in MGD diagnosis, may be beneficial in elevating inter-examiner reliability.
Andrew Emch, Anupam Laul, Kathleen Reuter, Michele Hager, and Aaron Zimmerman, of The Ohio State University College of Optometry, served as the real-time examiners. Gregory Hopkins, of The Ohio State University College of Optometry, was the image examiner. Topcon provided loan of the BG-4M meibography slit-lamp system.