Abstract
Purpose: :
To investigate whether differences in rating scale design (question format and response categories), for items with the same content, influences item calibration. Further, we aim to investigate whether rating scale differences lead to an overall difference in visual disability score measured by different patient reported outcome (PRO) instruments.
Methods: :
Sixteen existing PROs suitable for cataract assessment, and with different rating scales, were self-administered by patients on a cataract surgery waiting list. Two hundred and twenty-six items measuring visual disability in their native rating scale format were selected to develop a visual disability item bank. Items were calibrated on an interval level scale in logits using Rasch analysis. Fifteen item content areas (e.g. reading newspapers, driving at night) appearing in at least 3 different PROs were identified. Within each content area, item calibrations were compared and their range calculated. Similarly, 5 PROs [Visual Disability Assessment (VDA); National Eye Institute Visual Function Questionnaire (NEIVFQ); Activities of Daily Vision Scale (ADVS); Technology of Patient Experience (TyPE); and Cataract Symptom Scale (CatScale)] having at least 3 items in common with the Visual Function (VF-14) were identified. Using these common items, average item measures of these 5 PROs were compared with the reference PRO (VF-14).
Results: :
A total of 624 patients (mean age ± SD, 74·1 years ±9·4) participated. Items with the same content varied in their calibration by as much as two logits. Items with the content "reading the small print" had the largest range (1.99 logits) which was followed by "watching TV" (1.60). In reference to the VF-14 (0.00 logits), the rating scale of the VDA produced the most difficult items (1.13) followed by the NEIVFQ (0.66), ADVS (0.55), TyPE (0.43) and CatScale (0.24).
Conclusions: :
Rasch analysis demonstrated that differences in rating scale design can have a significant effect on item calibrations beyond item content. Both question format and response category labels appear to influence item calibrations and ultimately, overall measurement of visual disability. Therefore, it is difficult to compare research findings using different PROs. Moreover, it would be inelegant to use items from different PROs in their native rating scale formats for an item bank where it is desirable that item calibration reflects item content only. A preferred strategy would be to fit all items to a common rating scale.
Keywords: quality of life • clinical (human) or epidemiologic studies: systems/equipment/techniques