In association with members of the North American Neuro-Ophthalmology Society and the Neuro-Ophthalmology Research Disease Investigator Consortium (NORDIC), the National Eye Institute (NEI) held a public workshop on Neuro-Ophthalmic Disease Clinical Trial Endpoints, focusing on optic neuropathies, on June 28, 2019. Participants included researchers, clinicians, clinician-scientists, and regulatory authorities, working together to discuss issues relevant to endpoints and outcomes for clinical trials of treatments for optic neuropathies. The workshop was organized by Leonard A. Levin, MD, PhD, Chair of Ophthalmology and Visual Sciences at McGill University; Mark Kupersmith, MD, Chair of NORDIC and Director of the Neuro-ophthalmology Services at New York Eye and Ear Infirmary and Mount Sinai Healthcare System; Neil R. Miller, MD, FACS, Co-Chair of NORDIC and Professor of Ophthalmology, Neurology, and Neurosurgery at Johns Hopkins; Laura J. Balcer, MD, MSCE, NORDIC Chair of Quality of Life Committee; and Roy W. Beck, MD, PhD, Executive Director of the Jaeb Center for Health Research.
The goal of the workshop was to bring together experts in neuro-ophthalmology, previous and current neuro-ophthalmic trials, visual structure and function measurements, outcomes research, quality-of-life (QOL) measures, and regulatory issues and—using both formal presentations and panel discussions—determine the optimum outcome measures for various types of optic neuropathies. The following summarizes the day's presentations, discussions, and recommendations.
Dr. Paul Sieving, Director of the National Eye Institute, welcomed the attendees and provided background on the rationale for the meeting. Dr. Levin then opened the scientific session, emphasizing that different optic neuropathies often cause different types of visual loss. For example, the major damage in glaucoma usually begins in the peripheral visual field (VF), with preservation of central visual acuity (VA) until late in the disease. Papilledema follows a similar course. In contrast, optic neuritis usually causes rapid central vision loss that progresses rapidly over a week or two and then slowly improves over up to a year. Leber hereditary optic neuropathy (LHON), a maternally inherited disorder associated with mutations in mitochondrial DNA, affects central acuity like optic neuritis, but improvement is uncommon. Hence the necessity that clinically meaningful visual endpoints be chosen specifically for the disease being studied.
Panel Discussion on Lessons Learned From Endpoints in Optic Neuropathy Treatment Trials
Dr. Paul Van Veldhuisen discussed letter threshold changes versus mean letter counts in best-corrected VA (BCVA) as an outcome measure for efficacy of treatments.
The following are examples of threshold outcomes for BCVA: slow progression to blindness: ≤20/200 in the better eye for legal blindness, ≥20/40 after cataract surgery, and a 15-letter improvement from baseline. The last example is used in many trials to meet regulatory requirements. This outcome is on a patient level (i.e., it must be easily interpretable to both patient and clinician). Some statistics of threshold outcomes include risk difference, relative risk, odds ratio, and time-to-event outcome (Kaplan–Meier, hazard ratios).
In contrast to threshold outcomes, continuous outcomes for BCVA take advantage of the full distribution of data, using mean changes from baseline and median changes if distribution is influenced by outliers. Because these outcomes are measured on a group level, they are less interpretable to a clinician or patient, thus raising the question of if it is better to look at mean changes or thresholds when VA is the primary outcome.
The disadvantages of a dichotomous outcome from BCVA are loss of information, misclassification, and both floor and ceiling effects. Loss of information translates into a loss of statistical power, requiring a larger sample size. Another issue when creating a binary outcome based on responder and nonresponder classification is misclassification caused by measurement errors and systematic biases, potentially resulting in both false positives and false negatives. The issue is magnified when there is more variability of data and when there are a lot of data close to the cut point.
Floor and ceiling effects also tend to occur with VA binary outcome measurements. Although strict eligibility on VA can mitigate floor and ceiling effects, this limits generalizability.
VA letter scores may be more appropriate as primary outcomes, but they have their drawbacks. For example, large trials may show a small difference (not clinically meaningful) in treatment and control, irrespective of being significantly different. Threshold outcomes should be considered secondary unless reaching a threshold is the primary objective of the study. When powered on a continuous measure, studies may not have sufficient sample size to detect differences for thresholds.
Dr. Cynthia Owsley said that patients evaluate treatment success not by the number of letters read on a chart but on how well they can engage in visual activities of daily life. Two ways to assess this are visual task performance and patient-reported outcome questionnaires. Dr. Owsley discussed visual task performance measures, including driving, reading, mazes, visual processing speed, and physical activity, as endpoints in observational studies or clinical trials.
A range of driving simulators are available, such as PC-based simulators with a steering wheel and gas pedal, a cab from a real vehicle placed in front of central and peripheral screens, and virtual reality devices with moving bases, vibration, and proprioceptive feedback. Dependent measures include lane boundary crossings, average speed, pedestrian detection, obstacle detection, obeying traffic control devices, and the impact of secondary tasks (e.g., texting).
The FDA allows the use of driving simulation to evaluate drug safety (e.g., to see if the effect of sleeping medications persist in the morning and to differentiate sedating from nonsedating antihistamines). The FDA has not, however, used driving simulation to establish treatment efficacy for vision.
The most commonly used reading task in observational studies and clinical trials is the MNRead Acuity Chart. This chart requires the subject to read a series of sentences written at a third-grade level and provides measures of reading acuity, maximum reading speed, and critical print size. The International Reading Speed Texts (IReST) test requires the subject to read a paragraph written at a sixth-grade level and measures average reading speed. The rationale behind reading a paragraph rather than single sentences is that it requires more sustained reading, which vision-impaired people find more difficult than reading an individual sentence.
Dr. Jean Bennett's group developed a multiluminance mobility maze test for phase III RPE-65 gene therapy trials for Leber congenital amaurosis. It establishes validity, reliability, repeatability, and relationship to vision. A change of at least two light levels was considered a clinically meaningful change. The group worked closely with the FDA to meet endpoint criteria and establish efficacy of the intervention. Another maze test, the Pedestrian Accessibility and Mobility Laboratory, was developed by a group at University College London for their 2008 Leber congenital amaurosis gene therapy trial.
15 The investigators reported that the therapy resulted in “significant improvement in subjective test of visual mobility.”
Another outcome measure, visual processing speed, is the amount of time (in milliseconds) required to make a correct judgment about a visual stimulus and involves higher-order visual processing. Processing speed is probed under task demands such as divided attention or distraction. Visual processing speed measures have been associated with health and well-being. For example, poor visual processing speed is correlated with higher collision rates, driving performance problems, performance mobility problems, reduced physical activity, and increased time to perform instrumental activities of daily living, such as finding an object in a room.
Dr. Wiley Chambers discussed regulatory perspectives of the FDA on functional outcomes. Premarket review of drugs and devices occurs under the Food, Drug and Cosmetic Act and for biologics under the Public Health Service Act. The mission of the Center for Drug Evaluation and Research under the FDA is to ensure that safe and effective drugs are available to the American population. That goal is accomplished by monitoring drug development processes during the investigational stages (confidential), approving new drugs that have been proven safe and efficacious (confidential until approval and then designed to be transparent), and monitoring adverse effects after approval.
There are risks associated with all drug products. Assessment of a drug's risks improves as more individuals receive it. Products are approved based on an assessment of risks and benefits of the product when taken as labeled by the intended population. New drug applications require adequate and well-controlled studies to establish safety and efficacy. Isolated case reports, random experience, reports lacking details, and uncontrolled or partially controlled studies are not acceptable as the sole basis for approval of a product.
An adequate and well-controlled trial is one that has a clear statement of objectives; a study design that permits valid comparison; a subject selection method that provides adequate assurance that the subject group has or will develop the condition; minimum bias in assigning subjects to a group; minimum bias on the part of subjects, observers, and analysts; well-defined and reliable method(s) of assessment; and adequate result analysis to assess the effects of the drug.
There is a strong desire to approve products based on how the product helps subjects. Subjective endpoints are patient-reported outcomes that address single- or multiple-domain questions. Single-domain questions may inquire about itching, pain, ocular irritation, and ocular dryness. Usually, a single-domain question presents a 5-point scale of response and expects at least a 1-point mean change. Multiple-domain questions may include QOL measures. These questions are specific to the intended population and require knowledge of how much weight to give to each QOL domain. No ophthalmology QOL measures are currently validated for drug-evaluation research.
Functional endpoints for ophthalmology drug trials frequently measure visual function, which includes but is not limited to high- and low-contrast VA (doubling of visual angle on ETDRS chart); VF, in which a 7-dB change is usually expected over a predefined area of at least 5 points; contrast sensitivity, which uses doubling of visual angle; and activities of daily living, such as the ability to perform tasks in a low-light setting (endpoint is light level). Other functional endpoints acceptable to the FDA but associated with a high level of variability are reading speed, driving performance, and color discrimination.
While many objective measures have been used to approve new drug products, the ability to measure a difference in these endpoints does not necessarily make them clinically relevant. The following are objective measures that have been used as functional endpoints: intraocular pressure (5- to 7-mm Hg reduction), refractive power (50% slowing of progressive change), pupil size (maintenance of 6-mm diameter under bright light), and tear production (increase by 10 mm by Schirmer score).
Anatomic measures must predict a clinical benefit for patients (e.g., prevention of progression of cytomegalovirus retinitis, diabetic retinopathy, retinal detachment, or photoreceptor loss; resolution of anterior chamber cell and flare or conjunctival redness; and reepithelialization of a previously infected cornea). These measures alone do not require any visual performance measures to show improvement.
Q. How are criteria on doubling of visual angle determined?
A. Doubling of the visual angle was defined in the Early Treatment Diabetic Retinopathy Study as the minor change considered to be clinically significant.
Q. What FDA-guided qualification measures are used to determine an outcome?
A. There is an FDA guidance document specifying procedures for developing a patient-reported outcome that everyone is encouraged to follow. In that way, data can be generated that support patient-reported outcome.
Q. Is there an issue with learning effects and how is that handled?
A. VA, VF, etc. are affected by learning effects, but there are methods to minimize it. Low-contrast VA doesn't have a large practice effect compared with some of the other measures. In the walk-in-maze test, there are built-in measures to minimize practice effects.
Q. My question is about the variability or reliability of visual field where there's a new patient or a patient you've been following. There may be some false-positive measurements. How do you determine the true effects considering these factors and including training effect? How many times do you take readings for a single test and which reading is set as the best reading?
A. There is no substitute for multiple tests. In addition, it is important to have a good technician in the room with the patient during the test. He/she can observe the patient's responses and restart the test if necessary.
Q. When designing QOL domains, is it possible to design a tool that asks patients how much that domain means to them? Is there a way to individualize it?
A. The guidance document that I talked about earlier is designed to create a more targeted patient-reported outcome. This can be achieved by picking a population that is best suited for a particular test. Every test is meant for a specific population and the results do not give an accurate interpretation if the patient population doesn't match with the test.
Q. How do you address the variability of terms used by individual patients when assessing a large number of patients?
A. Instead of asking if there is, for example, about itching or dryness in the eye, patients can be asked what is bothering them the most. In follow-up, patient-reported changes throughout the trial duration can be followed.
Q. How does completing a maze describe an effect on visual function?
A. The goal of the maze testing used in a gene therapy trial was to demonstrate the patient's improved ability to see in low-light conditions. The maze was a modification of an activity of daily living and the test evaluated whether the task could be performed at a particular luminance level. After treatment, the test measured the difference in luminance in which the task could also be performed.
Dr. Maureen Maguire discussed statistical issues related to rare diseases. According to the Orphan Drug Act, “rare” is a prevalence of <200,000 in the United States or <0.06% of the population. These are serious diseases, usually with no effective treatment, and although each disease is rare individually, collectively there are many of them. Some study designs and approaches work around difficulties posed by small sample sizes.
She discussed strategies for trial design of rare diseases. To ensure that a given percentage of reduction in progression by treatment results in a larger difference in outcome, one could select patients who progress more quickly, select the outcome measures that progress the fastest, or follow patients for longer, so that there is more time to accumulate the difference between the groups. Alternatively, decreasing the variability within treatment and control groups would minimize error. This can be done by enrolling patients more homogeneous in progression rates (e.g., VF loss at baseline) and have less day-to-day variation in response (RNFL versus VF versus microperimetry), by increasing the number of measurements for each patient and analyzing repeat measurements as a cluster, and by decreasing testing variability by standardizing the way the test is done and graded.
Another approach to combatting the very small sample size problem is the use of historical controls (e.g., from natural history studies of the disease). One major concern is that the expected course of the treated patients is different from the course of the historical controls. There may be patient selection and informed consent factors for the historical controls. Moreover, there may be temporal changes and other contemporary factors affecting outcome in the treatment group that do not have an equivalent in the historical control group. FDA guidance on rare diseases states that historical controls may be considered when there is an unmet medical need; a well-documented, highly predictive disease course that can be measured objectively; and the expected treatment effect is large, self-evident, and temporally closely associated with treatment.
In the crossover design, a patient receives treatments in a randomized manner. A washout period is included between treatments so that the effect of the first has ended before the second one is started. Advantages of crossover studies are that each person serves as their own control, which reduces variability; a smaller sample size is needed compared with traditional parallel groups; and all patients receive the treatment being tested, which is more acceptable to study participants. Disadvantages are that the approach applies only to chronic, stable, and incurable conditions, such as pain or seizures; it is appropriate only for treatments that do not produce chronic effects; the washout period must be known; response must occur within the treatment window; and carryover and period effects can complicate interpretation.
In N-of-1 studies, treatment is given to patients in a randomized order over several periods. Again, a drug with a rapid response and effect loss is required. With longer periods of study, an N-of-1 study can help reach conclusions about the tested drug in specific patients. It is possible to combine the results from different patients using meta-analysis techniques. This type of study offers generalizability with a small number of patients and may be the most useful when treatment effect sizes are expected to differ across patients. N-of-1 studies have similar advantages and disadvantages to crossover studies.
In a randomized placebo-phase design, patients are randomized to receive treatment or placebo, as in traditional studies. However, the placebo group gets the experimental treatment after a certain amount of time. The assumption is that, if effective, treatment produces a lasting effect, and patients treated early will respond sooner. The advantage is that all patients get the new treatment, but a disadvantage is difficulty in determining how long the placebo phase should be.
In randomized withdrawal design, all patients receive treatment; those who apparently “respond” are randomized to continuation or withdrawal of treatment (placebo given) and followed. Patients who apparently do not respond are removed from the trial. An advantage of randomized withdrawal design is that it enriches the patient population in the trial with those most likely to respond to the treatment. Disadvantages are that it is difficult to fix the duration of the initial treatment phase, it is only applicable to treatments with no or few lasting effects, and it offers limited generalizability to the general patient population.
Adaptive designs are prospective plans to use data collected during the trial to change aspects of the study design, which usually aim to reduce the study sample size. Adaptive designs require interim monitoring that may lead to a change in randomization ratio based on observed results, change in the randomization design to increase balance on key covariates among treatment groups, or recalculation of sample size or follow-up time based on degree of variability.
In response adaptive randomization, a type of adaptive design, the goal is to maximize the number of patients assigned to the more effective treatment while minimizing the overall N. Responses from previously assigned patients are used to adjust the allocation ratio of treatments to a higher probability of the more effective treatment. Limitations are that it requires a very quick response, can effectively unmask investigators to next assignment, and disrupts balance over time, which means that it is vulnerable to temporal drift.
Pediatric Eye Disease Investigator Group and Diabetic Retinopathy Clinical Research Network
Dr. Michael Repka discussed the Pediatric Eye Disease Investigator Group (PEDIG), which was funded by the NEI in 1997 to conduct a congenital esotropia observational study and an amblyopia treatment study. Subsequently, NEI funded PEDIG as a network, and since then, PEDIG has conducted 47 studies, 10 of which are ongoing. PEDIG includes both community and academic sites in the United States, Canadian provinces, Mexico (although not currently), and the United Kingdom
PEDIG has its own institutional review board and coordinating center. Its sites include private practice ophthalmologists, ophthalmologists in academic medical centers, optometrists at optometry schools, and a few private-practice optometrists. The endpoints adopted by PEDIG include high- and low-contrast VA, angle of strabismus, control of strabismus, tear film, RNFL thickness using OCT, refractive error, and patient-reported outcomes. To measure visual acuity, EVA, mean of absolute VA outcome (used in randomized amblyopia studies), lines or letter of improvement, threshold acuity, and comparison with normal have been used. QOL measures include EyeQ, PedsQL, and disease- or treatment-specific questionnaires.
Dr. Repka then discussed the Diabetic Retinopathy Clinical Research (DRCR) network. This network's mission is to improve the lives of individuals with retinal pathology by performing high-quality, collaborative, clinical research that leads to a better understanding of retinal diseases and advances their treatment. Principal importance is placed on clinical trials, but epidemiologic outcomes and other research also may be supported. The DRCR uses EVA and macular OCT thickness as endpoints.
This all-day meeting covered a large number of topics in depth. The following summarizes some of the main conclusions relevant to carrying out clinical trials related to various optic neuropathies.
Papilledema (e.g. , idiopathic intracranial hypertension): Because visual acuity and color vision are not affected until severe damage has occurred, mean deviation or other measures of progression on automated perimetry are optimal outcome measures for monitoring the effects of papilledema on the visual system.
Leber hereditary optic neuropathy: Because of the profound loss of central vision in this condition, assessments of the central visual field are not helpful in determining treatment effects. Although the use electrophysiologic measures can be considered, optimal primary outcome measures have yet to be established.
Dominant optic atrophy: The Low Vision Cambridge Color Test can be used in patients with visual acuity >20/800 and has been found to be useful in assessing color discrimination.
Optic neuritis: Given that the visual acuity in patients with optic neuritis improves spontaneously over time, clinical measurements of visual function such as central acuity, color vision, or visual field can be less helpful. Electrophysiologic testing—specifically, the P100 latency of the visual evoked potential—may provide evidence for remyelination, and optical coherence tomography of the retinal ganglion cell/inner plexiform layer and peripapillary retinal nerve fiber layer can serve as secondary outcome measures.
Glaucoma: The results of automated perimetry—with clustering the timing of visual field examinations in order to increase the ability to detect change—can be used as primary outcome measures. Measurements of contrast sensitivity, color vision, and visual acuity can be used but may be less sensitive or specific. There is no consensus on structural endpoints, in part because the degree of correlation with clinically meaningful functional changes is not yet sufficient.
Clinical trial issues applicable to multiple optic neuropathies: A variety of trial designs can be used for rare optic neuropathies, in which the number of participants is likely to be small. Similarly, diseases in which there is severe visual loss can benefit from outcome measures that include patient quality of life. A variety of structural measures continue to be developed, which may eventually serve as primary outcomes once there is evidence for sufficient strength of the association with a clinically meaningful outcome.
Speaker Affiliations and Commercial Relationships:
Laura Balcer, New York University Grossman School of Medicine, New York, NY, USA, None;
Valerie Biousse, Emory University School of Medicine, Atlanta, GA, USA, GenSight (C), Neurophoenix (C);
Diego Cadavid, University of Massachusetts, Worcester, MA, USA, X4 Pharmaceuticals (E);
Wiley Chambers, US Food and Drug Administration, Silver Spring, MD, USA, None;
Catherine Cukras, National Eye Institute, NIH, Bethesda, MD, USA, None;
C. Gustavo De Moraes, Columbia University, New York, NY, USA, National Eye Institute/NIH (F), Centers for Disease Control and Prevention (F), Carl Zeiss (R, C, F), Topcon (F), Heidelberg Engineering (F), Novartis (F, C), Reichert (C, F), Galimedix (C), Belite (C);
Kay Dickersin, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA, None;
Brett G. Jeffrey, National Eye Institute, Bethesda, MD, USA, None;
Randy H. Kardon, University of Iowa, Iowa City, IA, USA, Heidelberg Engineering (F), Santen (R), Novartis (C), FaceX LLC (I), MedFace LLC (I), National Eye Institute/NIH (F), Department of Veterans Affairs (RR&D Division) (F);
Brad J. Kolls, Duke University School of Medicine, Durham, NC, USA, Corticare (C), Reneuron (F), research on stroke systems of care and stroke recovery (F);
Mark J. Kupersmith, Icahn School of Medicine at Mount Sinai, New York, NY, USA, National Eye Institute/NIH (F), Regenera (C), Palestroni Foundation (F), New York Eye and Ear Infirmary Foundation (F);
Leonard A. Levin, McGill University, Montreal, Canada, Canada Institutes for Health Research (F), Aerie (C), Eyevensys (C), Galimedix (C), Genentech (C), Perfuse (C), Quark (C), Regenera (C), Santen (C), Wisconsin Alumni Research Foundation (P);
Jeffrey M. Liebmann, Columbia University Irving Medical Center, New York, NY, USA, Aerie (C), Allergan (C), Carl Zeiss Meditech (F), Heidelberg Engineering (F), Genentech (C), Thea (C), Novartis (R);
Nicholas LaRocca, National Multiple Sclerosis Society, New York, NY, USA, None;
Maureen Maguire, University of Pennsylvania, Philadelphia, PA, USA, Genentech/Roche (C), Regenera (C), Foundation Fighting Blindness (F);
Juliette E. McGregor, University of Rochester, Rochester, NY, USA, National Eye Institute/NIH (F);
Neil R. Miller, Johns Hopkins University School of Medicine, Baltimore, MD, USA, National Eye Institute/NIH (F), Invex Therapeutics (C);
Cynthia Owsley, University of Alabama at Birmingham, Birmingham, AL, USA, National Eye Institute/NIH (F), National Institute on Aging (F), Centers for Disease Control and Prevention (F), Research to Prevent Blindness (F), Greater Baltimore Medical Center Educational Foundation Inc. (F);
Michael X. Repka, Johns Hopkins University, Baltimore, MD, USA, National Eye Institute/NIH (F), Alcon (C, F), Luminopia (C), American Academy of Ophthalmology (F);
Paul A. Sieving, National Eye Institute, NIH, Bethesda, MD, USA, NIH Intramural Program (F);
Paul C. VanVeldhuisen, The Emmes Company, LLC, Rockville, MD, USA, National Eye Institute/NIH (F);
Michael Wall, University of Iowa, Carver College of Medicine, Iowa City, IA, USA, None.
Disclosure: L.A. Levin, See Commercial Relationships above; M. Sengupta, None; L.J. Balcer, None; M.J. Kupersmith, See Commercial Relationships above; N.R. Miller, See Commercial Relationships above