July 1999
Volume 40, Issue 8
Free
Clinical and Epidemiologic Research  |   July 1999
Compliance with Methodological Standards When Evaluating Ophthalmic Diagnostic Tests
Author Affiliations
  • Robert Harper
    From the Department of Ophthalmology, Manchester Royal Eye Hospital, United Kingdom, and
  • Barnaby Reeves
    Health Services Research Unit, London School of Hygiene and Tropical Medicine, United Kingdom.
Investigative Ophthalmology & Visual Science July 1999, Vol.40, 1650-1657. doi:
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Robert Harper, Barnaby Reeves; Compliance with Methodological Standards When Evaluating Ophthalmic Diagnostic Tests. Invest. Ophthalmol. Vis. Sci. 1999;40(8):1650-1657.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

purpose. To draw attention to the importance of methodological standards when carrying out evaluations of ophthalmic diagnostic tests by reviewing the extent of compliance with these standards in reports of evaluations published within the ophthalmic literature.

methods. Twenty published evaluations of ophthalmic screening/diagnostic tests or technologies were independently assessed by two reviewers for compliance with the following methodological standards: specification of the spectrum composition for populations used in the evaluation, analysis of pertinent subgroups, avoidance of work-up (verification) bias, avoidance of review bias, presentation of precision of results for test accuracy, presentation of indeterminate test results, and presentation of test reproducibility.

results. Compliance ranged from just 10% (95% CI, 1%–32%) for presentation of test reproducibility data and avoidance of review bias to 70% (95% CI, 46%–88%) for avoidance of work-up bias and presentation of indeterminate test results. Only 5 of the 20 evaluations complied with four or more of the methodological standards and none with more than five of the standards.

conclusions. The evaluations of ophthalmic diagnostic tests discussed in this article show limited compliance with accepted methodological standards but are no worse than previously described for evaluations published in general medical journals. Adherence to these standards by researchers can improve the study design and reporting of evaluations of new diagnostic techniques. Limited compliance, combined with a lack of awareness of the standards among users of research evidence, may lead to the inappropriate adoption of new diagnostic technologies, with a consequent waste of health care resources.

Ophthalmic diagnostic tests help the clinician to make a diagnosis, assess the severity and prognosis of disease, and choose appropriate treatments. Both new and established technologies have been advocated to investigate patients in ophthalmic practice. Before using these technologies to guide clinical decisions, however, clinicians must know how “good” the tests are. Unfortunately, diagnostic tests are frequently not evaluated rigorously before they are made available to clinicians, and once implemented, their performance is sometimes disappointing. For example, the diagnostic value of clinical contrast sensitivity tests in ophthalmic practice has been limited, 1 despite considerable enthusiasm when commercial tests were first made available. 2 3 4  
The performance of a diagnostic test is often referred to as“ diagnostic accuracy,” (i.e., the extent to which the result of a particular test correctly classifies patients into predefined disease categories). Diagnostic accuracy is usually characterized by the sensitivity and specificity of a test. Although likelihood ratios are considered to be the key indices in making clinical decisions about patients, by providing an explicit tool for revision of diagnostic probabilities according to the test outcomes, 5 it is the sensitivities and specificities that are most commonly presented when reporting the diagnostic accuracy of tests. 
Evaluations of diagnostic tests should comply with accepted standards to provide clinically relevant estimates of diagnostic accuracy. However, there is limited compliance with such standards within the general medical literature. 6 The purposes of this article are first, to draw attention to the need for researchers to comply with the standards when performing evaluations of diagnostic accuracy and the need for practitioners to appraise published reports of the diagnostic accuracy against the standards before implementing tests in clinical practice, and second, to review published reports of the diagnostic accuracy of ophthalmic tests, both to illustrate the importance of the standards and to estimate the extent of compliance with the standards. Within the context of this article, screening tests will also be included, because evaluations of their performance should comply with the same criteria as that used for the evaluation of diagnostic tests. 
Methods
Selection of Evaluation Studies
Studies were eligible for review if indices of diagnostic accuracy were reported, the test under evaluation was intended for clinical application, and the findings were reported in peer-reviewed ophthalmic or general medical journals. 
Twenty published evaluations of both established and new ophthalmic tests were selected. Eleven evaluations of diagnostic or screening tests for glaucoma were chosen, spanning structural (clinical examination and imaging of the optic nerve head), physiological (intraocular pressure [IOP] and pattern electroretinogram) and psychophysical (perimetry, contrast sensitivity) tests. To illustrate wider applicability of the standards to a range of ophthalmic conditions, a MEDLINE database search was performed using a recommended strategy. 5 A further nine studies were selected from this process, including imaging and photographic, visual function, and laboratory tests. The studies reviewed are drawn predominantly from recent publications within the literature (12 were published between 1995 and 1997, 5 between 1990 and 1994, and 3 before 1990). In each case, selection of the studies was made before assessment with the standards. 
Assessment of Compliance with Standards
All articles were independently assessed by two reviewers for compliance with seven widely accepted methodological standards. 6 7 8 9 10 11 Reviewers used the definitions described by Reid et al. 6 These standards are summarized in Table 1 . Overall agreement between the reviewers was 80% (range, 55%–100%; see Table 2 ). All instances of disagreement (28 of 140) were resolved by discussion. The majority occurred for three standards (21 disagreements for standards 2, 4, and 6); these disagreements are considered further in the discussion. 
In addition to overall percent agreement, kappa values were calculated to measure the agreement between the reviewers after taking account of agreement expected by chance; kappa values ranged from− 0.24 to 1.0 for each criterion (Table 2) . Kappa values appear inconsistent with percent agreement in some cases because of the uniformity of the responses across evaluations for some criteria (e.g., standards 4 and 7). When responses are relatively uniform across the examples rated (e.g., standard 7, in which 18 of the 20 evaluations failed to meet the criterion), it is very difficult to obtain a high kappa score because “expected” agreement (calculated from the marginal totals) is also high. Standard 5, to which responses were relatively uniform, only achieved a kappa of 1.00, because there was 100% agreement. 
Results
Table 3 summarizes the extent of compliance of the 20 studies reviewed 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 with the seven standards, after disagreements between reviewers had been resolved. 
Standard 1: Specification of Spectrum Composition
Twelve of the 20 studies (60%, 95% CI, 36%–81%) complied with this standard. Characterizing the study population is important, because the sensitivity and specificity of a test can be markedly influenced by the demographic and clinical composition of the population studied (e.g., age, sex, ethnicity, disease severity, comorbidity). Although this standard merely requires that the spectrum composition is specified, this standard is important, because it allows the reader to judge whether the estimates of diagnostic accuracy reported by the evaluation can be applied to the population in which the reader wants to use the test (see the Discussion section). 
Standard 2: Analysis of Pertinent Subgroups
This standard was met by 11 of the 20 evaluations (55%; 95% CI, 32%–77%). The second standard is closely linked with the first, because it concerns the way in which the sensitivity and specificity of a test can vary for different subgroups in a population. If the population studied includes people with wide-ranging characteristics (e.g., all ages), overall estimates of sensitivity and specificity may disguise considerable variations in performance in different subgroups. 12 Thus, even if overall performance is disappointing, a test may perform well in a subgroup; alternatively, overall performance may appear to be good, but may be unacceptably poor in a minority of subjects. It is important to point out, however, that this standard is intended to encourage the reporting of diagnostic accuracy in clinically relevant subgroups; it is not appropriate to search for “good” performance in a subgroup without a priori justification. 
An early evaluation of oculokinetic perimetry estimated sensitivity to be more than 80% for the detection of glaucomatous visual field defects. 15 Subsequent evaluations of this form of perimetry, which included patients with a range of visual field loss, indicated much lower sensitivity for early visual field loss. 20 29  
Standard 3: Avoidance of Work-up, or Verification, Bias
Fourteen of the 20 evaluations (70%, 95% CI, 46%–88%) complied with this standard. This bias is introduced when subjects with positive or negative diagnostic test results are selectively referred to receive verification by the validating criterion (i.e., the gold standard), or where the groups of “diseased” and “normal” subjects (based on the gold standard) have been selected according to some clinical factor relating to the disease. 
A population-based study of glaucoma screening 24 30 provides an example of work-up bias. More than 5000 subjects were screened using a test battery, and diagnostic accuracies for optic disc assessment, IOP, and field screening were reported. However, only those who “failed” one or more of the tests under evaluation were referred for a “definitive ophthalmologic examination.” Referred subjects were classified as having or not having glaucoma by this examination (i.e., the gold standard), but the diagnosis of glaucoma could be made only in those subjects who were referred. Thus, there are likely to have been a small number of truly glaucomatous patients who“ passed” all screening tests and whose condition was not detected because they were not referred for the gold standard. If all patients had had the definitive examination, the results in these patients would have been classified as false negatives rather than true negatives, suggesting that the reported sensitivity estimate may be biased upward, and the specificity estimate downward (see the Discussion section). 
Standard 4: Avoidance of Review, or Expectation, Bias
Only 2 of the 20 evaluations (10%; 95% CI, 1%–32%) complied with this standard. Bias can be introduced if the results of the test under evaluation are interpreted with a knowledge of the results of the gold standard (or vice versa). 
Studies by Xu et al. 31 and Bjerrum 14 reported the sensitivity and specificity of diagnostic tests for dry eye (keratoconjunctivitis sicca) in patients with primary Sjögren’s syndrome and other connective tissue diseases. However, it is not clear whether the clinician evaluating the tests was masked with respect to the validation status of the subjects. 
Standard 5: Precision of Results for Test Accuracy
Only 3 of the 20 evaluations (15%; 95% CI, 3%–38%) complied with this standard. If sensitivity and specificity estimates are reported without a measure of precision, clinicians cannot know the range within which the true values of sensitivity and specificity may lie. For example, the sensitivity estimate of 73% for a laboratory test for ocular sarcoidosis 26 based on only 22 patients has a 95% CI ranging from 54% to 92%. In contrast, the specificity estimate of 83% has better precision (95% CI, 74%–92%), a reflection of both the higher point estimate and the larger sample size used by the researchers for their nonsarcoid group (n = 70) (Note: The formula for the SE of a proportion, \(\sqrt{(\mathit{pq}/\mathit{n})}\) , is based on a binomial approximation to the normal distribution and can be used to calculate 95% CIs for sensitivity and specificity: p ± 1.96 \(\sqrt{(\mathit{pq}/\mathit{n})}\) , where p represents either sensitivity or specificity, q = 1 −p, and n is the sample size for either sensitivity or specificity. When p or q × n is less than 5, the validity of the approximation becomes doubtful, and exact methods should be used to calculate the 95% CI [see Fig. 1 ]). 
There is widespread use of statistical CIs in the ophthalmic literature, and journals usually require CIs to be specified for descriptive estimates and analytic comparisons. 32 However, the same journals seem less vigilant for evaluations of diagnostic accuracy. 
Standard 6: Presentation of Indeterminate Test Results
Fourteen of the 20 evaluations (70%; 95% CI, 46%–88%) complied with this standard. For a variety of reasons, tests occasionally yield indeterminate results. For example, patient cooperation may be limited or the presence of media opacities may obscure fundus observation. Knowledge of the percentage of indeterminate results is important in deciding how applicable a test may be to a population of interest. The way in which indeterminate results are classified (i.e., as positive, negative, or by excluding them altogether 33 ) affects the estimate of diagnostic accuracy. 
A study to determine the effectiveness of videorefraction in screening for significant refractive errors in infants reported a sensitivity of 84% and a specificity of 91% for hyperopia of more than 4.00 D when cycloplegic videorefraction was used. 22 However, limited cooperation, failure to obtain adequate blur circles and difficulty in obtaining adequate cycloplegia restricted the final sample size. The prevalence of these indeterminate results was not reported. Because this test is intended to be used for screening, the “untestable” children should arguably have been included and the test results regarded as positive (i.e., requiring further investigation), thereby decreasing the specificity of the test. 
Standard 7: Test Reproducibility
Only 2 of the 20 evaluations (10%; 95% CI, 1%–32%) complied with this standard. Limited reproducibility is inevitably reflected in the sensitivity and specificity estimates, and so these estimates provide valid measures of performance that take into account the degree of reproducibility. However, reporting test reproducibility is important to allow the reader to appraise whether the same level of reproducibility can be obtained in the study setting, particularly when a test result is based on expert judgment. For example, experts or proponents of a new test might be expected to be able to apply a new grading system with better reproducibility than nonexperts. Therefore, evidence about test reproducibility should clarify whether the test results are reproducible in “average” or “expert” hands. 
Discussion
The seven quality standards described for evaluations of diagnostic tests have a role comparable to that of more familiar quality standards for randomized controlled trials (RCTs), 34 which have led to an improvement in the design of RCTs and better critical appraisal by readers of published reports. 
A review of diagnostic evaluations published in four prestigious general medical journals between 1978 and 1993 6 found widespread noncompliance with these standards, with compliance exceeding 50% for only one of the seven standards. For the present review compliance ranged from 10% for presentation of test reproducibility data and avoidance of review bias to 70% for avoidance of work-up bias (the standard for which compliance was also the highest in the review by Reid et al. 6 ) and presentation of indeterminate results. Overall, only 25% of studies complied with four or more of the standards, a proportion that is comparable to that found by Reid et al. 6 for recently published studies. 
We acknowledge that our findings from 20 selected evaluations may not be representative of the ophthalmic literature in general. We selected our sample to include a high proportion of evaluations of glaucoma tests, albeit tests of varying modality, because of the significance of glaucoma in ophthalmology, the need for early detection with diagnostic tests at an early asymptomatic stage, and the considerable research effort in evaluating screening or diagnostic methods. In addition, two of the evaluations we reviewed were from the Baltimore Eye Study, 24 30 and these cannot be considered to be independent examples. 
Despite the selected nature of our sample, we believe that our findings for compliance, although somewhat imprecise, are nevertheless likely to be reasonably representative of compliance within ophthalmology for four reasons. First, our findings are similar to those reported by Reid et al. 6 for publications in high-ranking general medical journals. Second, we have recently conducted a systematic review for one of the standards that found a similarly limited compliance. 35 Third, all reports we reviewed were published in peer-reviewed journals, with more than half of them published in high-ranking ophthalmic or general medical journals. Finally, selection of studies was made before assessment of compliance. 
The limited agreement between the reviewers when assessing standards 2, 4, and 6 may cast doubt on the importance of these particular standards. The particular problem with standard 4 related to interpretation of review bias in instances in which the diagnostic test or validating criterion was automated. The disagreements would not have occurred with a strict interpretation of the standards, and we suggest that this standard should be clarified (see discussion later). Disagreement on standards 2 and 3 arose primarily from a lack of clarity in some articles or the failure by reviewers to identify relevant information, rather than from ambiguity of the standards themselves. These problems would be unlikely to arise if the reporting of compliance with the standards were mandatory, as for RCTs. 32  
Applicability of the Methodological Standards in Ophthalmology
The level of compliance that we found would usually cast doubt on the relevance of the reported findings to clinical practice. However, it is important to consider the applicability of the standards within ophthalmology, because some of the standards may be less important for some of the tests included in the review. 
First, it might be argued that the standard for test reproducibility is less important for automated tests when the classification of the test result does not depend on expert judgment. All sources of measurement error are reflected in the estimates of sensitivity and specificity. The crucial difference is that when variation in the expertise of observers is not a source of measurement error, the estimates are more likely to be generalizable across settings. 
Second, it might be argued that applying the standard for avoidance of review bias is unnecessary if a test is entirely automated, on the assumption that a test result obtained automatically cannot be biased through interpretation by a clinician. However, if an automated diagnostic test is used before validation, 22 review bias can still occur because the diagnostic test may influence interpretation of the gold standard; it is only when an automated diagnostic test is used after validation 17 that review bias is likely to be avoided. Because of the possibility of an operator influencing an automated test in a subtle way (e.g., in the setting up of test parameters or in the interpretation of the result), we recommend that researchers maintain and report the independence of the test and gold standard procedures, even when one or the other appears to be completely independent of the operator. The standard on avoidance of review bias, as described by Reid et al., 6 does not discuss how automated tests should be judged. Consequently, it was perhaps not surprising that the main source of disagreement between reviewers was interpretation of this standard when a diagnostic test was automated or semiautomated. 
Finally, it might be argued that if the main intended application of a screening or diagnostic test is to a specific population only (e.g., stereoacuity tests or video refraction in infants), then it is not realistic to expect the reporting of indices of accuracy for further subgroups. 
Additional Standards
In addition to the seven accepted standards described here, we believe that there are three further principles that researchers should adhere to: There should be a clear definition of the gold standard, the gold standard should be independent of the test under evaluation, and the population studied should be appropriate for the intended application of the test. 
Definition of the Gold Standard.
We believe that the gold-standard should always be clearly defined, even though there may be some overlap between this requirement and standard 3 (i.e., work-up bias). This requirement is particularly important in situations in which it is impracticable or unethical to administer the gold standard to all patients. The overlap between definition of the gold standard and work-up bias is demonstrated in the study by Tielsch et al. 30 The gold standard for this evaluation was described as a definitive ophthalmologic examination and, consequently, we scored this evaluation as failing to avoid work-up bias, because only those who failed at least one of the screening tests were referred for this examination. It appears that the researchers regarded the referral of all subjects for a definitive examination as impracticable, a not unreasonable decision. However, the implication of this decision for the gold standard was not spelled out—namely, that the gold standard definition of normality should have become “no disease found on definitive examination or passed all screening tests.” We believe that such a statement clarifies the true gold-standard definition, and highlights the possibility of work-up bias in a way in which the original article did not. 
Work-up bias may be unavoidable if the gold standard carries a health risk, making it unethical to administer the gold standard to all subjects (e.g., highly invasive tests). In such cases, the duty of the researchers is to make explicit the validating criterion for normality in the absence of the gold standard; this may be demonstration of normality on a battery of tests, or the continuing absence of disease (demonstrated by whatever means) over a prolonged period of follow-up. 
Work-up bias is difficult to avoid in the context of the evaluation of screening tests in which the prior probability of disease is usually very low. A practicable evaluation must either make assumptions about the normality of those who pass the screening test, as discussed above, or select a population for the evaluation that contains a much higher proportion of diseased people than would be expected when screening (e.g., by choosing equal numbers of definitively normal people and people who have been newly referred for investigation). Selection of this kind almost inevitably results in work-up bias, because the reasons for referral are likely to be associated with the results of the screening test. 36  
Independence of the Gold Standard.
The gold standard should be independent of the diagnostic test under evaluation—that is, the test under evaluation should not be performed as part of the gold standard. This requirement should hold, even when the objective of the evaluation is to investigate the decrease in diagnostic accuracy when one or more elements of the gold standard are omitted. This problem is illustrated by an evaluation of the sensitivity and specificity of a 26-point screening program on the Henson field screener. 21 The points tested for the screening program are also tested during the extended program, which was used as the gold standard criterion. The investigators simply calculated the diagnostic accuracy for the screening program by extracting the data from the extended test, rather than by performing the screening and extended tests on separate occasions. This procedure eliminates variability between screening and extended tests that would occur in practice. (In fact, a subsequent evaluation using an independent validating criterion has confirmed high sensitivity and specificity for this particular form of field screening. 29 ) This criticism may also apply to the study by Bjerrum, 14 who appears to have performed two of the tests under evaluation as part of the gold-standard examination for diagnosis of dry eye. 
It is often the case that the results of the test under evaluation and the gold standard are highly correlated, not because of the problem just described, but because the test and the gold standard are measuring similar underlying properties (e.g., aspects of visual function in glaucoma). It is not surprising, therefore, that a field test has higher diagnostic accuracy than IOP or cup-to-disc ratio when screening for glaucoma, when the gold standard includes a definitive perimetric examination. 16 24 30 In these circumstances, the evaluation is not invalid, but it is important to be aware of the inherent tautology. 
Appropriateness of the Study Population.
It is difficult to recommend including the third additional standard as a true standard because of the subjectivity of judging appropriateness. The appropriateness of the population included in the evaluation has previously been mentioned in relation to the standard on specification of spectrum composition. This point can be graphically illustrated by comparing two datasets on the diagnostic accuracy of IOP, shown as receiver operating characteristic curves in Figure 2 . 16 30 These curves suggest a considerable difference in diagnostic accuracy, with curve B indicating that IOP is a much better test. 
Closer inspection of the two curves suggests that they differ primarily with respect to the specificity estimates because both curves have comparable sensitivity estimates (e.g., <50% for an IOP >22 mm Hg). This finding is consistent with other epidemiologic studies of glaucoma. 37 38 However, the specificity estimates at this level of IOP vary from approximately 75% in curve A to more than 90% in curve B. Because specificity estimates are derived from the normal subjects in the sample under evaluation, this discrepancy must be attributable to differences between the studies in the populations of normal subjects. Daubs and Crick 16 evaluated glaucoma diagnostic tests in a hospital population. The normal subjects did not have field loss but had been referred to King’s College Hospital as“ suspects” (Crick, personnel communication). Because“ suspects” are likely to have included a higher proportion of people with raised IOP, because raised IOP is a common indicator for referral, the specificity findings are lower than expected. In the population-based study of Tielsch et al., 30 the specificity of more than 90% for an IOP cutoff criterion of more than 22 mm Hg is more consistent with the known prevalence of ocular hypertension. 39 Consequently, their data are more representative of the performance of tonometry for screening. 
In considering appropriateness, it is also important to highlight the selective nature of populations in some evaluations. Evaluations of the diagnostic accuracy of glaucoma tests often use the results from the Humphrey Visual Field Analyzer (San Leandro, CA) as the gold standard. Sometimes subjects are selected to have prior experience of automated perimetry 18 23 or subjects with unreliable test results are excluded. 29 Selecting subjects in this way is likely to result in inflated estimates of diagnostic accuracy and gross underestimation of indeterminate results. The prevalence of unreliable subjects is not insignificant and has been estimated to be as high as 45% in glaucomatous subjects and 30% in control subjects. 40  
A study to evaluate an artificial neural network for the automatic detection of diabetic retinopathy from fundus images, provides another example of the selective nature of a study population. 17 The sample used to test the system comprised 200 diabetic fundus images and 101 normal fundus images. The researchers concluded that the system could be used as an aid to the screening of diabetic patients for retinopathy. However, given that the normal fundus images do not appear to have included nondiabetic lesions (e.g., age-related maculopathy), the specificity of the system in a screening setting is likely to be worse than reported. It may be appropriate to use selected populations for the preliminary evaluation of a system, but any conclusion about the wider application requires an evaluation on a representative population. 
In conclusion, this article has highlighted the importance of complying with methodological standards when evaluating ophthalmic diagnostic tests. Our findings emphasize the need for researchers to comply with standards, so that published estimates of diagnostic accuracy are relevant to clinical practice, and for practitioners to appraise critically evaluations of diagnostic tests against the standards, to avoid being misled by biased (and sometimes overoptimistic) results. 
Improved diagnostic accuracy is only one of many steps toward effective treatment, and the use of rigorously evaluated tests cannot guarantee better patient outcomes. 5 However, patient care can be expected to improve if ineffective diagnostic tests are avoided, because the widespread use of tests with limited accuracy can have serious health and financial consequences. Ideally, diagnostic tests that show promising accuracy should be subjected to RCTs to determine whether the test results in improved health outcomes. 41  
 
Table 1.
 
Table 1.
 
Methodological Standards for the Evaluation of Diagnostic Tests
Table 1.
 
Table 1.
 
Methodological Standards for the Evaluation of Diagnostic Tests
(1) Specification of spectrum composition
This standard requires at least three of the following four descriptors to be reported for the study population: the age and sex distribution, the initial clinical symptoms and/or disease stage of the populations studied, and the eligibility criteria for the subjects included.
(2) Analysis of pertinent subgroups
This standard requires the evaluation to cite the indices of accuracy for any pertinent demographic or clinical subgroup of the population.
(3) Avoidance of work-up (verification) bias
This standard requires an evaluation to assign all subjects to receive both diagnostic testing and gold standard verification.
(4) Avoidance of review bias
This standard requires an evaluation to make a clear statement about the independence in interpreting both the test and the gold standard procedure.
(5) Presentation of precision of results for test accuracy
This standard requires an evaluation to report the 95% CI or SE associated with the indices of diagnostic accuracy.
(6) Presentation of indeterminate test results
This standard requires an evaluation to state the number of indeterminate results and whether these results had been included or excluded when the indices of accuracy were calculated.
(7) Presentation of test reproducibility
This standard requires that the reproducibility of a test be reported or that the report cite other sources of this information.
Table 2.
 
Table 2.
 
Agreement between Reviewers for Each of the Seven Standards, Expressed in Terms of Percentage of Agreement and the Kappa Statistic
Table 2.
 
Table 2.
 
Agreement between Reviewers for Each of the Seven Standards, Expressed in Terms of Percentage of Agreement and the Kappa Statistic
Standard Percentage Agreement Kappa Statistic
1. Specification of spectrum composition 85% 0.69
2. Analysis of pertinent subgroups 65% 0.25
3. Avoidance of work-up bias 90% 0.76
4. Avoidance of review bias 55% −0.24*
5. Precision of results for test accuracy 100% 1.00
6. Presentation of indeterminate test results 75% 0.47
7. Test reproducibility 90% −0.05*
Table 3.
 
Table 3.
 
Compliance of Evaluation Studies with the Seven Methodological Standards (Table 1)
Table 3.
 
Table 3.
 
Compliance of Evaluation Studies with the Seven Methodological Standards (Table 1)
Study Nature of Evaluation Methodological Standard
1* 2 3 4, † 5 6 7* Total
Ariyasu et al. 12 Vision tests for sight-threatening eye disease Yes Yes Yes No Yes No No 4
Birch et al. 13 Stereoacuity tests in preschool children No No Yes Yes No No No 2
Bjerrum 14 Dry eye tests Yes Yes No No No No No 2
Damato et al. 15 Oculokinetic perimetry for glaucomatous field loss Yes Yes Yes No No Yes No 4
Daubs and Crick 16 Range of glaucoma screening tests No No No No No No No 0
Gardner et al. 17 Automatic detection of diabetic retinopathy No No Yes Yes No Yes No 3
Graham et al. 18 Psychophysical/electrophysiological tests in glaucoma Yes No Yes No No Yes No 3
Harding et al. 19 Photography/ophthalmoscopy in diabetic retinopathy No Yes Yes No Yes Yes No 4
Harper et al. 20 Oculokinetic perimetry for glaucomatous field loss No Yes No No Yes Yes No 3
Henson and Bryson 21 Suprathreshold visual field screening No No Yes No No Yes No 2
Hodi 22 Videorefraction for refractive errors in infants No Yes Yes No No No No 2
Johnson and Samuels 23 Frequency-doubling perimetry for glaucoma Yes Yes Yes No No Yes Yes 5
Katz et al. 24 Suprathreshold field screening in glaucoma Yes No No No No Yes No 2
Mikelberg et al. 25 Heidelberg optic nerve head imaging in glaucoma Yes No Yes No No Yes No 3
Power et al. 26 Laboratory/scanning tests for ocular sarcoidosis Yes No Yes No No Yes No 3
Smolek and Klyce 27 Videokeratography in keratoconus No Yes Yes No No Yes No 3
Sommer et al. 28 Automated threshold visual field testing in glaucoma Yes No Yes No No No No 2
Sponsel et al. 29 Visual field screening in glaucoma Yes Yes No No No Yes No 3
Tielsch et al. 30 Range of glaucoma screening tests Yes Yes No No No Yes Yes 4
Xu et al. 31 Dry eye tests Yes Yes Yes No No Yes No 4
Figure 1.
 
Illustration of the breadth of exact binomial 95% CIs as a function of the sample estimate of the proportion of interest and sample size. From outside to center, the pairs of lines represent sample sizes of 20, 40, 60, 100, 200, and 500. Note the 95% CI is at its widest for a proportion equal to 0.5 and narrows as the proportion tends to 0 or 1. To use the graph, read off the upper and lower 95% CIs and simply add and subtract the sample estimate; for example, a sample estimate of 0.5 (i.e., a sensitivity or specificity of 50%), based on a sample size of 100, has a 95% CI that ranges from 0.4 to 0.6.
Figure 1.
 
Illustration of the breadth of exact binomial 95% CIs as a function of the sample estimate of the proportion of interest and sample size. From outside to center, the pairs of lines represent sample sizes of 20, 40, 60, 100, 200, and 500. Note the 95% CI is at its widest for a proportion equal to 0.5 and narrows as the proportion tends to 0 or 1. To use the graph, read off the upper and lower 95% CIs and simply add and subtract the sample estimate; for example, a sample estimate of 0.5 (i.e., a sensitivity or specificity of 50%), based on a sample size of 100, has a 95% CI that ranges from 0.4 to 0.6.
Figure 2.
 
Receiver operating characteristic curves for tonometry, drawn from the data of Daubs and Crick 16 (curve A, open circles) and Tielsch et al. 30 (curve B, closed circles). The data points represent the sensitivity/specificity at different levels of IOP (in millimeters of mercury).
Figure 2.
 
Receiver operating characteristic curves for tonometry, drawn from the data of Daubs and Crick 16 (curve A, open circles) and Tielsch et al. 30 (curve B, closed circles). The data points represent the sensitivity/specificity at different levels of IOP (in millimeters of mercury).
Moseley MJ, Hill AR. Contrast sensitivity testing in clinical practice. Br J Ophthalmol. 1994;78:795–797. [CrossRef] [PubMed]
Arden GB, Jacobson JJ. A simple grating test for contrast sensitivity: preliminary results indicate value for screening in glaucoma. Invest Ophthalmol Vis Sci. 1978;17:23–32. [PubMed]
Ginsberg AP. A new contrast sensitivity vision test chart. Am J Optom Physiol Opt. 1984;61:403–407. [CrossRef] [PubMed]
Della Sala S, Bertoni G, Somazzi L, Stubbe F, Wilkins AJ. Impaired contrast sensitivity in diabetic patients with and without retinopathy: a new technique for rapid assessment. Br J Ophthalmol. 1985;69:136–142. [CrossRef] [PubMed]
Deeks JJ, Morris JM. Evaluating diagnostic tests. Obstet Gynaecol. In press.
Reid MC, Lachs MS, Feinstein AR. Use of methodological standards in diagnostic test research: getting better but still not good. JAMA. 1995;274:645–651. [CrossRef] [PubMed]
Ransohoff DF, Feinstein AR. Problems of spectrum and bias in evaluating the efficacy of diagnostic tests. N Engl J Med. 1978;299:926–930. [CrossRef] [PubMed]
Cooper LS, Chalmers TC, McCally M, Berrier J, Sacks HS. The poor quality of early evaluations of magnetic resonance imaging. JAMA. 1988;259:3277–3280. [CrossRef] [PubMed]
Arroll B, Schechter MT, Sheps SB. The assessment of diagnostic tests: a comparison of medical literature in 1982 and 1985. J Gen Intern Med. 1988;3:443–447. [CrossRef] [PubMed]
Jaeschke A, Guyatt GH, Sackett DL, for the Evidence-based Medicine Working Group. Users’ guides to the medical literature, III: how to use an article about a diagnostic test, A: are the results of the study valid?. JAMA. 1994;271:389–391. [CrossRef] [PubMed]
Jaeschke A, Guyatt GH, Sackett DL, for the Evidence-based Medicine Working Group. Users’ guides to the medical literature, III: how to use an article about a diagnostic test, B: what are the results and will they help me in caring for patients?. JAMA. 1994;271:703–707. [CrossRef] [PubMed]
Ariyasu RG, Lee PP, Linton KP, LaBree LD, Azen SP, Siu AL. Sensitivity, specificity, and predictive values of screening tests for eye conditions in a clinic-based population. Ophthalmology. 1996;103:1751–1760. [CrossRef] [PubMed]
Birch E, Williams C, Hunter J, Lapa MC. Random dot stereoacuity in preschool children. J Pediatr Ophthalmol Strabismus. 1991;34:217–222.
Bjerrum KB. Test and symptoms in keratoconjunctivitis sicca and their correlation. Acta Ophthalmologica. 1996;74:436–441.
Damato BE, Chyla J, McClure E, Jay JL, Allan JD. A hand-held OKP chart for the screening of glaucoma: preliminary evaluation. Eye. 1990;4:632–637. [CrossRef] [PubMed]
Daubs J, Crick RP. Epidemiological analysis of the King’s College Hospital glaucoma data. Res Clin Forums. 1980;2:41–59.
Gardner GG, Keating D, Williamson TH, Elliott AT. Automatic detection of diabetic retinopathy using an artificial neural network: a screening tool. Br J Ophthalmol. 1996;80:940–944. [CrossRef] [PubMed]
Graham SL, Drance SM, Chauhan BC, et al. Comparison of psychophysical and electrophysiological testing in early glaucoma. Invest Ophthalmol Vis Sci. 1996;37:2651–2662. [PubMed]
Harding SP, Broadbent DM, Neoh C, White MC, Vora J. Sensitivity and specificity of photography and direct ophthalmoscopy in screening for sight threatening eye disease: the Liverpool diabetic eye study. BMJ. 1995;311:1131–1135. [CrossRef] [PubMed]
Harper RA, Hill AR, Reeves BC. Effectiveness of unsupervised oculokinetic perimetry for detecting glaucomatous visual field defects. Ophthalmol Physiol Opt. 1994;14:199–202. [CrossRef]
Henson DB, Bryson H. Clinical results with the Henson–Hamblin CFS 2000. In: Greve EL, Heijl A, eds. Seventh International Visual Field Symposium. Dordrecht: Dr. W. Junk 1987:233–238.
Hodi S. Screening of infants for significant refractive error using videorefraction. Ophthalmol Physiol Opt. 1994;14:310–313. [CrossRef]
Johnson CA, Samuels SJ. Screening for glaucomatous visual field loss with frequency doubling perimetry. Invest Ophthalmol Vis Sci. 1997;38:413–425. [PubMed]
Katz J, Tielsch JM, Quigley HA, Javitt J, Witt K, Sommer A. Automated suprathreshold screening for glaucoma: the Baltimore Eye Survey. Invest Ophthalmol Vis Sci. 1993;34:3271–3277. [PubMed]
Mikelberg FS, Parfitt CM, Swindale SL, Graham SL, Drance SM, Gosine R. Ability of the Heidelberg Retina Tomograph to detect early glaucomatous visual field loss. J Glaucoma. 1995;4:242–247. [PubMed]
Power WJ, Neves RA, Rodriguez A, Pedroza–Seres M, Foster CS. The value of combined serum angiotensin-converting enzyme and gallium scan in diagnosing ocular sarcoidosis. Ophthalmology. 1995;102:2007–2011. [CrossRef] [PubMed]
Smolek MK, Klyce SD. Current keratoconus detection methods compared with a neural network approach. Invest Ophthalmol Vis Sci. 1997;38:2290–2299. [PubMed]
Sommer A, Cheryl E. nger MS, Witt K. Screening for glaucomatous visual field loss with automated threshold perimetry. Am J Ophthalmol.. 1987;103:681–684.
Sponsel WE, Ritch R, Stamper R, et al. Prevent blindness America visual field screening study. Am J Ophthalmol. 1995;120:699–708. [CrossRef] [PubMed]
Tielsch JM, Katz J, Singh K, et al. A population-based evaluation of glaucoma screening: The Baltimore Eye Survey. Am J Epidemiol. 1991;134:1102–1110. [PubMed]
Xu KP, Yagi Y, Toda I, Tsubota K. Tear function index: a new measure of dry eye. Arch Ophthalmol. 1995;113:84–88. [CrossRef] [PubMed]
Begg C, Cho M, Eastwood , et al. Improving the of quality of randomized controlled trials. The CONSORT statement. JAMA. 1996;276:637–639. [CrossRef] [PubMed]
Simel DL, Feussner JR, Delong ER, Matchar DB. Intermediate, indeterminate and uninterpretable diagnostic test results. Med Decis Making. 1987;7:107–114. [CrossRef] [PubMed]
Moher D, Jadad AR, Nichol G, et al. Assessing the quality of randomised controlled trials: an annotated bibliography of checklists. Control Clin Trials. 1995;16:62–73. [CrossRef] [PubMed]
Harper R, Reeves B. Reporting of precision for estimates of diagnostic accuracy: a review. BMJ. 1999;318:1322–1323. [PubMed]
Harper RA, Reeves BC. Glaucoma screening: the importance of combined test data. Optom Vis Sci. 1999;318:1322–1323.
Leibowitz HM, Krueger DE, Maunder LR., et al. The Framingham Eye Study monograph. Surv Ophthalmol. 1980;24(Suppl):335–610.
Bengtsson B. The prevalence of glaucoma. Br J Ophthalmol. 1981;65:46–49. [CrossRef] [PubMed]
David R. Ocular hypertension. Cairns J eds. Glaucoma. 1986;551–567. Grune & Stratton London.
Katz J, Sommer A. Reliability indexes of automated perimetric tests. Arch Ophthalmol. 1988;106:1252–1254. [CrossRef] [PubMed]
Holland WW, Stewart S. Screening in heath care. 1990; London: Nuffield Provincial Hospitals Trust
Figure 1.
 
Illustration of the breadth of exact binomial 95% CIs as a function of the sample estimate of the proportion of interest and sample size. From outside to center, the pairs of lines represent sample sizes of 20, 40, 60, 100, 200, and 500. Note the 95% CI is at its widest for a proportion equal to 0.5 and narrows as the proportion tends to 0 or 1. To use the graph, read off the upper and lower 95% CIs and simply add and subtract the sample estimate; for example, a sample estimate of 0.5 (i.e., a sensitivity or specificity of 50%), based on a sample size of 100, has a 95% CI that ranges from 0.4 to 0.6.
Figure 1.
 
Illustration of the breadth of exact binomial 95% CIs as a function of the sample estimate of the proportion of interest and sample size. From outside to center, the pairs of lines represent sample sizes of 20, 40, 60, 100, 200, and 500. Note the 95% CI is at its widest for a proportion equal to 0.5 and narrows as the proportion tends to 0 or 1. To use the graph, read off the upper and lower 95% CIs and simply add and subtract the sample estimate; for example, a sample estimate of 0.5 (i.e., a sensitivity or specificity of 50%), based on a sample size of 100, has a 95% CI that ranges from 0.4 to 0.6.
Figure 2.
 
Receiver operating characteristic curves for tonometry, drawn from the data of Daubs and Crick 16 (curve A, open circles) and Tielsch et al. 30 (curve B, closed circles). The data points represent the sensitivity/specificity at different levels of IOP (in millimeters of mercury).
Figure 2.
 
Receiver operating characteristic curves for tonometry, drawn from the data of Daubs and Crick 16 (curve A, open circles) and Tielsch et al. 30 (curve B, closed circles). The data points represent the sensitivity/specificity at different levels of IOP (in millimeters of mercury).
Table 1.
 
Table 1.
 
Methodological Standards for the Evaluation of Diagnostic Tests
Table 1.
 
Table 1.
 
Methodological Standards for the Evaluation of Diagnostic Tests
(1) Specification of spectrum composition
This standard requires at least three of the following four descriptors to be reported for the study population: the age and sex distribution, the initial clinical symptoms and/or disease stage of the populations studied, and the eligibility criteria for the subjects included.
(2) Analysis of pertinent subgroups
This standard requires the evaluation to cite the indices of accuracy for any pertinent demographic or clinical subgroup of the population.
(3) Avoidance of work-up (verification) bias
This standard requires an evaluation to assign all subjects to receive both diagnostic testing and gold standard verification.
(4) Avoidance of review bias
This standard requires an evaluation to make a clear statement about the independence in interpreting both the test and the gold standard procedure.
(5) Presentation of precision of results for test accuracy
This standard requires an evaluation to report the 95% CI or SE associated with the indices of diagnostic accuracy.
(6) Presentation of indeterminate test results
This standard requires an evaluation to state the number of indeterminate results and whether these results had been included or excluded when the indices of accuracy were calculated.
(7) Presentation of test reproducibility
This standard requires that the reproducibility of a test be reported or that the report cite other sources of this information.
Table 2.
 
Table 2.
 
Agreement between Reviewers for Each of the Seven Standards, Expressed in Terms of Percentage of Agreement and the Kappa Statistic
Table 2.
 
Table 2.
 
Agreement between Reviewers for Each of the Seven Standards, Expressed in Terms of Percentage of Agreement and the Kappa Statistic
Standard Percentage Agreement Kappa Statistic
1. Specification of spectrum composition 85% 0.69
2. Analysis of pertinent subgroups 65% 0.25
3. Avoidance of work-up bias 90% 0.76
4. Avoidance of review bias 55% −0.24*
5. Precision of results for test accuracy 100% 1.00
6. Presentation of indeterminate test results 75% 0.47
7. Test reproducibility 90% −0.05*
Table 3.
 
Table 3.
 
Compliance of Evaluation Studies with the Seven Methodological Standards (Table 1)
Table 3.
 
Table 3.
 
Compliance of Evaluation Studies with the Seven Methodological Standards (Table 1)
Study Nature of Evaluation Methodological Standard
1* 2 3 4, † 5 6 7* Total
Ariyasu et al. 12 Vision tests for sight-threatening eye disease Yes Yes Yes No Yes No No 4
Birch et al. 13 Stereoacuity tests in preschool children No No Yes Yes No No No 2
Bjerrum 14 Dry eye tests Yes Yes No No No No No 2
Damato et al. 15 Oculokinetic perimetry for glaucomatous field loss Yes Yes Yes No No Yes No 4
Daubs and Crick 16 Range of glaucoma screening tests No No No No No No No 0
Gardner et al. 17 Automatic detection of diabetic retinopathy No No Yes Yes No Yes No 3
Graham et al. 18 Psychophysical/electrophysiological tests in glaucoma Yes No Yes No No Yes No 3
Harding et al. 19 Photography/ophthalmoscopy in diabetic retinopathy No Yes Yes No Yes Yes No 4
Harper et al. 20 Oculokinetic perimetry for glaucomatous field loss No Yes No No Yes Yes No 3
Henson and Bryson 21 Suprathreshold visual field screening No No Yes No No Yes No 2
Hodi 22 Videorefraction for refractive errors in infants No Yes Yes No No No No 2
Johnson and Samuels 23 Frequency-doubling perimetry for glaucoma Yes Yes Yes No No Yes Yes 5
Katz et al. 24 Suprathreshold field screening in glaucoma Yes No No No No Yes No 2
Mikelberg et al. 25 Heidelberg optic nerve head imaging in glaucoma Yes No Yes No No Yes No 3
Power et al. 26 Laboratory/scanning tests for ocular sarcoidosis Yes No Yes No No Yes No 3
Smolek and Klyce 27 Videokeratography in keratoconus No Yes Yes No No Yes No 3
Sommer et al. 28 Automated threshold visual field testing in glaucoma Yes No Yes No No No No 2
Sponsel et al. 29 Visual field screening in glaucoma Yes Yes No No No Yes No 3
Tielsch et al. 30 Range of glaucoma screening tests Yes Yes No No No Yes Yes 4
Xu et al. 31 Dry eye tests Yes Yes Yes No No Yes No 4
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×