Purpose
Inconsistency in inter-reader agreement presents a challenge in the diagnosis and management of retinopathy of prematurity (ROP). This study evaluates inter-reader agreement using a large web-based image repository and compares reader performance against a gold standard diagnostic algorithm.
Methods
Image sets from 716 eye exams (358 clinical visits with 3958 unique images) were uploaded to a secure online repository and independently interpreted by three readers using International Classification of Retinopathy of Prematurity criteria. A gold standard diagnosis was established for each component of the eye exam (zone, stage, presence of plus disease, presence of aggressive posterior ROP (AP-ROP), and overall diagnostic category) by combining the diagnosis selected by a majority of readers (majority diagnosis) with the clinical diagnosis from ophthalmoscopic examination. Inter-reader agreement was defined as the proportion of records for which at least 2 of 3 readers agreed and was calculated for each component of the eye exam. Absolute agreement between majority diagnosis and clinical diagnosis and between individual readers and the gold standard was also calculated for each component of the exam.
Results
Inter-reader agreement (Table 1) ranged from 696/716 (97%) for stage to 716/716 (100%) for zone, AP-ROP, and plus disease. Complete concordance (i.e., 3 of 3 agreement) ranged from 430/716 (60%) for stage to 681/716 (95%) for AP-ROP. Absolute agreement between majority diagnosis and ophthalmoscopic diagnosis was 578/716 (81%) for zone, 507/716 (71%) for stage, 627/716 (88%) for plus disease, 650/716 (91%) for AP-ROP, and 503/716 (70%) for overall category. Absolute agreement between image readers and the gold standard (Table 2) ranged from 369/434 (85%) for stage to 428/434 (99%) for zone.
Conclusions
Overall, there was high inter-reader agreement for each component of the eye exam and moderate agreement between the majority diagnoses and ophthalmoscopic diagnoses. There was moderate-to-high agreement between the majority diagnoses and the gold standard diagnoses. Absolute agreement was lowest for overall category and stage, possibly because of a greater number of available categorical values. These findings suggest that by utilizing more than two image readers, diagnostic consistency in ROP can be improved.
Keywords: 706 retinopathy of prematurity •
550 imaging/image analysis: clinical