Open Access
Perspective  |   August 2024
Clinical Evaluation of Artificial Intelligence-Enabled Interventions
Author Affiliations & Notes
  • H. D. Jeffry Hogg
    University Hospitals Birmingham NHS Foundation Trust, Birmingham, United Kingdom
    Institute of Applied Health Research, College of Medical and Dental Sciences, University of Birmingham, Birmingham, United Kingdom
    NIHR-Supported Incubator in AI & Digital Healthcare, Birmingham, United Kingdom
  • Alexander P. L. Martindale
    Brighton and Sussex Medical School, Brighton, United Kingdom
  • Xiaoxuan Liu
    University Hospitals Birmingham NHS Foundation Trust, Birmingham, United Kingdom
    Institute of Applied Health Research, College of Medical and Dental Sciences, University of Birmingham, Birmingham, United Kingdom
    NIHR-Supported Incubator in AI & Digital Healthcare, Birmingham, United Kingdom
    National Institute for Health and Care Research (NIHR) Birmingham Biomedical Research Centre, United Kingdom
  • Alastair K. Denniston
    University Hospitals Birmingham NHS Foundation Trust, Birmingham, United Kingdom
    Institute of Applied Health Research, College of Medical and Dental Sciences, University of Birmingham, Birmingham, United Kingdom
    NIHR-Supported Incubator in AI & Digital Healthcare, Birmingham, United Kingdom
    National Institute for Health and Care Research (NIHR) Birmingham Biomedical Research Centre, United Kingdom
  • Correspondence: Alastair K. Denniston, Institute of Applied Health Research, College of Medical and Dental Sciences, University of Birmingham, Edgbaston, Birmingham B15 2TT, UK; [email protected]
Investigative Ophthalmology & Visual Science August 2024, Vol.65, 10. doi:https://doi.org/10.1167/iovs.65.10.10
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      H. D. Jeffry Hogg, Alexander P. L. Martindale, Xiaoxuan Liu, Alastair K. Denniston; Clinical Evaluation of Artificial Intelligence-Enabled Interventions. Invest. Ophthalmol. Vis. Sci. 2024;65(10):10. https://doi.org/10.1167/iovs.65.10.10.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Artificial intelligence (AI) health technologies are increasingly available for use in real-world care. This emerging opportunity is accompanied by a need for decision makers and practitioners across healthcare systems to evaluate the safety and effectiveness of these interventions against the needs of their own setting. To meet this need, high-quality evidence regarding AI-enabled interventions must be made available, and decision makers in varying roles and settings must be empowered to evaluate that evidence within the context in which they work. This article summarizes good practices across four stages of evidence generation for AI health technologies: study design, study conduct, study reporting, and study appraisal.

Clinical studies are how researchers provide evidence for healthcare decision makers to evaluate and prioritize various interventions. These decision makers include regulators, payers, and other healthcare leaders; all have different scopes of influence and interest, but each has impactful decisions to make.1 The quality of these decisions depends on four critical stages of evidence generation and evaluation: study design, study conduct, study reporting, and study appraisal. A limitation of just one of these four stages compromises the evidence base it offers for intervention evaluation (Fig. 1). End users of an intervention also benefit from understanding this process. Their understanding of the evidence underpinning the scope and limitations of an intervention will help appropriately develop users’ trust and support its safe and effective use. 
Figure 1.
 
A schematic of the process of intervention evaluation through four stages of research design, conduct, reporting, and appraisal. The design of clinical research establishes the maximum potential value of evidence that can be offered, with each subsequent step acting as sequential and modifiable filters on the value ultimately delivered to healthcare practice. In the scenario illustrated in the figure, it is the reporting of the research that limits the evidence it generates to inform practice, but it could be any combination of the four stages.
Figure 1.
 
A schematic of the process of intervention evaluation through four stages of research design, conduct, reporting, and appraisal. The design of clinical research establishes the maximum potential value of evidence that can be offered, with each subsequent step acting as sequential and modifiable filters on the value ultimately delivered to healthcare practice. In the scenario illustrated in the figure, it is the reporting of the research that limits the evidence it generates to inform practice, but it could be any combination of the four stages.
The surge of interest in artificial intelligence (AI) and a desire to accelerate its perceived benefits to struggling healthcare systems has put pressure on our traditional models of evidence generation and evaluation. Although the next decade is likely to see a range of different approaches to this, the principles of evidence generation and evaluation remain. The information we have on the performance and safety of an intervention depends on the quality of the studies undertaken (their design, conduct, and reporting). The quality of our decision relating to that intervention then depends on our ability to appraise those studies and our wider understanding of the sociotechnical context into which the intervention is going to deploy. 
In this article, we focus on AI-enabled interventions using machine learning (ML). This focus was chosen, as ML technologies (particularly the deep-learning subgroup of ML technologies) are responsible for the sustained interest and investment in AI-enabled interventions and the additional considerations recommended for their evaluation.28 These considerations stem from characteristics of AI that may challenge regulators and other evaluators. They include (1) the capacity of AI to “learn” associations in new data (which may be spurious), (2) low explainability of how an AI model arrived at its outputs, (3) high dependency of AI performance on training context (such as population and setting), and (4) assignment of AI to use cases with a relatively high degree of autonomy and clinical risk (such as diagnosis).9 Ideally, researchers, decision makers, and end users would feel confident in accommodating these considerations in an efficient and effective evidence generation and appraisal process for AI-enabled interventions. The reality is that contemporary oversight of best practices in the evolving landscape of AI-enabled intervention evaluation can be challenging to maintain. This limits key stakeholders’ understanding and ultimately the rate and scale of responsible AI innovation in health care. 
This commentary seeks to address this challenge by summarizing contemporary guidance and best practices on evaluating AI health technologies. It draws together resources for researchers, decision makers, and potential end users to support their different roles in the generation and clinical evaluation of evidence for AI-enabled interventions. 
Designing and Conducting Clinical Studies
Studies of AI health technologies should aim to generate evidence that justifies and directs investment in the next step of an intervention along the translational pathway. The selection of an appropriate study design therefore depends upon how far a specific AI technology is on that translational pathway. This is a large part of the reason why the exponential growth of clinical studies of AI is largely composed of preclinical studies and small-scale prospective clinical studies (equivalent to early-phase clinical trials).10 However, whether considering a pharmacotherapy, a physical medical device, or an AI technology, larger scale interventional clinical trials should ultimately inform decisions to implement healthcare interventions into real-world care. Estimating real-world effectiveness from preclinical performance is often challenging but is particularly difficult with AI-enabled interventions. They are complex interventions that can have unpredictable impacts on a healthcare pathway, such as underperformance for a specific subpopulation.11,12 Interventional studies provide evidence beyond simple technology performance (e.g., diagnostic accuracy) and can provide vital information about the actual effect on the patient and other downstream consequences. The randomized controlled trial (RCT) design is held up as the benchmark for evidence generation due to its use of random allocation to tackle the major bias of unequal allocation between interventional and control groups. To maximize their value, RCTs should be designed to reflect the intended real-world application as closely as possible. 
In contrast to the exponential growth of “early-phase” clinical studies of AI health interventions, the number of larger scale clinical trials of AI health technologies (phase 3 equivalents) is still small.10,13 This scarcity of published late-stage study designs is not purely a rational reflection of the translational stage of AI technologies, as they number far less than the number of AI-enabled medical devices that have been granted regulatory approval for clinical use.13,14 Two factors may account for this. First, there may be evidence available to regulators that is not in the public domain. Such publication is not an obligation for AI manufacturers making submissions to regulators. Publishing also requires significant resource allocation from manufacturers and may even risk their intellectual property. Such failures to publicly report studies are unhelpful, though, and manufacturers should be encouraged to share results openly to support better evaluation decisions across healthcare systems. Second, it appears that many regulatory approvals for AI-enabled interventions are based on non-interventional studies alone.15 This may benefit AI technology manufacturers who avoid the costs associated with large-scale interventional trials but does not support decision makers evaluating AI-enabled interventions for patient and service benefit. It is striking that, despite the huge interest in AI health technologies, a systematic review looking for prospective RCTs of AI health technologies in any clinical setting identified just 65 eligible publications since September 2020.13 Most of these studies (n = 24) took place in China, with Europe (n = 14), the United States (n = 12), and Japan (n = 5) being the major contributors. Categorizing eligible studies clinically, the largest contributors were gastroenterology (n = 15) and radiology (n = 5), with primary care, emergency medicine, diabetology, and cardiology each contributing four eligible RCTs. Despite their scarcity, this systematic review indicated a good overall quality of study design across these RCTs (Table). 
Table.
 
Key Considerations in Clinical Study Design for AI-Enabled Interventions and Potential Design Considerations
Table.
 
Key Considerations in Clinical Study Design for AI-Enabled Interventions and Potential Design Considerations
Although well-designed, large-scale RCTs are a valuable source of evidence in the evaluation of an AI health technology, it is important to understand their limitations. To complement the quantitative evidence available from RCTs, researchers should also design qualitative research studies that use stakeholder perspectives and experiences to generate evidence regarding the sociotechnical mechanisms by which an AI health technology influences outcomes in a specific healthcare context.16 Many such studies are also preclinical in nature; just 20 studies of stakeholder perspectives of AI-enabled interventions in prospective clinical use were identified by a recent bibliometric study.17 Ideally, qualitative studies would complement quantitative research methods to help support improvements in intervention design and implementation within various healthcare contexts.18 These facets concerning the mechanism by which complex healthcare interventions exert their impact on the wider health system are addressed elsewhere.19 
Reporting Clinical Studies
The design and conduct of a clinical study establish the potential evidence that it can offer to decision makers and potential end users of an AI technology (Fig. 1). This potential can only be fully realized through researchers’ commitment to transparent and complete reporting by (1) highlighting potential limitations of design and delivery that could introduce bias and (2) allowing other researchers and practitioners to test the reproducibility and replicability of the study.20 This is a key step in validating research findings in science generally, but it is a particular issue in the context of AI-enabled interventions, where performance of a model is vulnerable to shifts in clinical, technical, and operational contexts.11 Complete and transparent reporting by researchers also maximizes the amount of information that is available to decision makers and end users regarding the likely strengths and limitations of the intervention. 
Academic journal editors and peer reviewers act as gatekeepers to the reporting and dissemination of clinical evidence. They exercise that duty based on their quality assessments of clinical studies and their reporting. To support those assessments, reporting guidelines provide explicit and accessible standards of best practice regarding what should be reported and how it should be reported. The accessibility of these reporting guidelines also benefits researchers, who can use them as an explicit guide to what the wider research community will expect them to report. The standard bearer for reporting guidelines is the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) Network, which has supported the development of guidelines for a wide range of study types and contexts, most notably the Consolidated Standards of Reporting Trials (CONSORT) guidelines and, more recently, a series of extensions for application to AI health technologies (Fig. 2).29,21,22 In addition, there are several independent guidelines that are specifically focused on AI technology. There is, reassuringly, a great deal of overlap in the requirements listed by these different AI-specific reporting guidelines.23 
Figure 2.
 
Schematic of evidence generation across the translational pathway and relevant research reporting guidelines for AI-enabled interventions. Reporting guidelines specific to AI are shaded black, with other relevant guidelines shaded gray. The figure is only indicative of the primary application of the guidelines; a number of them may contain elements that are also of value in other contexts. *Guidelines with AI-specific expansions in development.
Figure 2.
 
Schematic of evidence generation across the translational pathway and relevant research reporting guidelines for AI-enabled interventions. Reporting guidelines specific to AI are shaded black, with other relevant guidelines shaded gray. The figure is only indicative of the primary application of the guidelines; a number of them may contain elements that are also of value in other contexts. *Guidelines with AI-specific expansions in development.
EQUATOR reporting guidelines are organized according to study type. Additional guidelines (or extensions to guidelines) address issues that may not be adequately addressed by the core family of guidelines, notably for specific interventions (e.g., psychological interventions), outcomes (e.g., patient-reported outcomes), and newer study designs (e.g., n = 1 trial). 
Reporting Guidelines for the Early Developmental Stages of AI Health Technologies
The majority of AI health technology research concerns a preclinical stage, which produces evidence that cannot account for the various factors that influence real-world performance. Nevertheless, these studies may represent the only available evidence and provide important insights into the provenance of an AI health technology. These insights into the architecture of models and their training data are important and may not be as clear in later studies. Minimum Information about Clinical Artificial Intelligence Modelling (MI-CLAIM) is a reporting guideline that supports high-quality evidence production at this earlier translational stage.2 Of particular note is part 6 of the MI-CLAIM checklist, which aims to support the reproducibility of research through promoting the reporting of sufficient end-to-end technical details (including relevant data, code, and dependencies).2 In the development of AI health technologies, this is often challenged by researchers’ concerns over the potential intellectual property associated with their work and the ethical and governance safeguards regarding data sharing. 
Reporting Guidelines for Diagnostic and Prognostic Accuracy Studies of AI Health Technologies
At present, the majority of regulated AI health technologies are diagnostic systems that classify disease presence/absence, disease subtype, and/or disease severity.10,24 Indeed, many areas where AI is expected to have the largest impact are in automated diagnosis of disease for population screening programs, such as breast cancer screening and diabetic eye disease screening.25 These studies should adhere to the 2015 Standards for Reporting Diagnostic Accuracy Studies (STARD) guidelines in the absence of AI-specific reporting guidelines.26 At the time of writing, an AI-specific extension, STARD-AI, is under development.3 
Clinical prediction models estimate the likelihood of an individual having or developing disease using predictor variables (risk factors such as age, sex, and biomarkers). The ability of AI to analyze large and complex datasets of predictor variables has led to the development of several potential AI prediction models, such as AI health technologies for predicting sepsis.27 The widely accepted EQUATOR reporting guideline for prediction and prognostic model studies is the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD 2015) guideline, also with a recently published AI-specific extension.8,28 
Reporting Guidelines for Early Interventional Studies of AI Health Technologies
The way in which AI technologies are designed to interact with humans and the learning curves associated with this are critical for successful implementation. Complex decision-support tools including those based on AI may have unpredictable human interaction properties warranting specific investigation.29 There is an argument that early, small-scale feasibility studies are necessary after initial validation of algorithmic performance, prior to launching into large, expensive prospective trials. To support this stage of evaluation, Developmental and Exploratory Clinical Investigation of DEcision-support systems driven by Artificial Intelligence (DECIDE-AI) was published in 2022 and makes new recommendations for how such studies should be reported.4 
Reporting Guidelines for RCTs of AI Health Technologies
RCTs are considered the gold-standard experimental design in the hierarchy of evidence, providing the most rigorous assessment of preventative, diagnostic, and therapeutic interventions. Standard Protocol Items: Recommendations for Interventional Trials (SPIRIT) and CONSORT are the accepted reporting standards for randomized trials and their protocols, endorsed by the International Committee of Medical Journal Editors.22,30 Extensions to these two guidelines, SPIRIT-AI and CONSORT-AI, were published in 2020.5,6 These extensions include 15 and 14 new items, respectively, considered as minimum standards of reporting for AI health technologies, in addition to reporting core items as outlined by SPIRIT and CONSORT. AI-specific recommendations include, but are not limited to
  • Description of the type and versions of the AI model and its intended use
  • Access and restrictions to access or reuse of the AI model and/or its code
  • How the AI system was integrated in trial sites
  • Inclusion and exclusion criteria at the level of participants and input data
  • Assessment and handling of poor-quality or unavailable input data
  • Any human–AI interaction elements
  • The output of the AI intervention and its impact on decision making or other elements of clinical practice
  • Analysis of performance errors of the system
Other Relevant Reporting Guidelines
An area of increasing importance to the evaluation of AI health technologies is health economics. The relevant EQUATOR guideline extension is the Consolidated Health Economic Evaluation Reporting Standards (CHEERS-AI).7 The mechanisms by which AI health technologies should be reimbursed and how their economic value should be defined and measured are evolving, but evidence of economic value remains a key consideration for decision makers.31,32 
As AI health technologies become more widely adopted, observational studies based on real-world evidence will emerge as valuable forms of evidence for decision makers. These studies should apply the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) research reporting guidelines.33 Another more practice-oriented guideline that can be used alongside others is the 2014 Template for Intervention Description and Replication (TIDieR).16 It demands more practical disclosures from healthcare studies, including the assumed social and organizational basis for the effect of an intervention, a characterization of the actors involved in the implementation, and adaptions to the intervention or the context to which it was being applied. These details can help decision makers to assess the feasibility of an AI intervention in their own healthcare context. 
Appraising Clinical Studies
The design, conduct, and reporting of clinical studies represents evidence production, whereas their appraisal represents the evaluation of that evidence regarding quality, relevance, and other factors. Decision makers can then synthesize the available evidence alongside other sources of information relating to need and context (including input from patients, professionals, and other stakeholders) to make better, evidence-informed choices about AI-enabled healthcare interventions. The ability to appraise clinical evidence is also valuable for potential end users of AI health technologies, who need to have appropriate trust in the performance of any technology they may adopt in the context of their own practice. By gaining this authentic and independent sense of the strengths and limitations of a technology, end users can support better procurement decisions within their organizations, they can be more confident about the limitations of use (e.g., when to challenge a “decision”), and be better informed to communicate with patients about the device and the interpretation of its outputs. 
Reporting guidelines can be used as frameworks to guide appraisal, but their primary goal is to define what should be reported (focused on transparency) rather than defining how it should be done (focused on quality and minimizing bias). For potential end users of AI health technologies, there is a complementary range of appraisal tools with the primary goal of supporting systematic quality appraisal of evidence.34 These tools were mainly developed for evidence synthesis research but are highly relevant to the evaluation of evidence in isolation and are accessible to potential end users of AI health technologies without deep experience of research or commissioning. The Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool identifies key considerations for diagnostic accuracy studies, and an AI specific extension (QUADAS-AI) is under development.35 Similarly, the Prediction model Risk Of Bias ASsesment Tool (PROBAST) identifies key considerations for prognostic models and also has an AI extension (PROBAST-AI) under development.36 There are also resources that propose heuristics to aid the evaluation of studies of AI-enabled healthcare interventions of all study designs.34,37,38 
When appraising clinical evidence of an AI health technology, potential end users should first consider the intervention being described. Understanding the operating principle of the tool itself alongside its inputs and outputs allows a basic check of the viability of an AI-enabled intervention in a specific clinical context. The method used for training the AI technology may also offer insight into this; for example, training a model using a federated learning approach can address information governance barriers to multi-institutional datasets, helping to produce more robust models with generalizable performance.39 Understanding the data on which an AI technology was trained will also permit an appraisal of its likely performance in a specific population and any subgroups for whom it may underperform.40 The capacity and expertise of real-world end users of the AI technology should be compared to the individuals who used it in the clinical study. This becomes particularly important when appraising AI-enabled interventions where the AI has an assistive rather than automating function. For many use cases, diagnostic performance of AI-assisted clinicians is greater than either clinician or AI performance in isolation.41 This performance differential is sensitive to the timing of clinician–AI interaction in the decision-making process and the degree of explainability that is presented to the clinician. However, the benefits of an assistive use case over an automated one are offset by the instability it introduces to intervention performance (requiring long-term monitoring by adopters) and risk (between users and over time within the same user) and the missed opportunity to reallocate clinician time to other tasks.11,13,29 To assess the balance of these costs and benefits for a particular AI-enabled intervention, choices made by researchers about how the AI interacts with clinicians in a study should be justified. This would ideally be based on usability methods that test different AI–clinician interaction protocols but should at least consider the practical demands of the disease and care pathway that the intervention targets. The practical demands of the digital infrastructure and data flows required for a given AI-enabled intervention are also relevant for decision makers. Flexibility from AI manufacturers can help here. For example, a product that can be used with software installed locally at the healthcare provider institution or remotely on a secure cloud-platform can be expected to align with many different institutions’ needs and priorities.12 
Conclusions
If the emergence of AI into health care is to improve health care, decision makers must see through the surrounding hype to invest in the minority of AI-enabled interventions that promise evidence-based benefits. To do this, effective evaluation is critical. The evaluation of AI health technologies is unfamiliar territory for most decision makers in health care, but they already hold most of the relevant expertise. It is important to recognize that the evaluation of these technologies builds on existing, well-established methodologies and that evaluators need not be overawed by the technological sophistication of the interventions under consideration. The distinct features (and risks) of AI health technologies are outlined in the emerging AI-specific guidelines, with the EQUATOR network providing reporting guidelines tailored for specific trial designs. Given the sensitivity of AI health technologies to their implementation context, it is also crucial that decision makers ensure that they are provided with evidence that addresses these issues. There is a challenge here for the community to invest in later phase studies such as RCTs and other interventional studies that provide such evidence, so that we can ensure that the AI health technologies that come to patients are effective, safe, and equitable. 
Acknowledgments
H.D.J. Hogg is fully funded by a Doctoral Fellowship from the National Institute of Health and Care Research (NIHR301467). 
This is independent work carried out at the National Institute for Health and Care Research (NIHR) Birmingham Biomedical Research Centre (BRC). The views expressed are those of the author(s) and not necessarily those of the NIHR or the Department of Health and Social Care. 
Disclosure: H.D.J. Hogg, None; A.P.L. Martindale, None; X. Liu, None; A.K. Denniston, None 
References
Hogg HDJ, Al-Zubaidy M, Technology Enhanced Macular Services Reference Group, et al. Stakeholder perspectives of clinical artificial intelligence implementation: systematic review of qualitative evidence. J Med Internet Res. 2023; 25: e39742. [CrossRef] [PubMed]
Norgeot B, Quer G, Beaulieu-Jones BK, et al. Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist. Nat Med. 2020; 26(9): 1320–1324. [CrossRef] [PubMed]
Sounderajah V, Ashrafian H, Golub RM, et al. Developing a reporting guideline for artificial intelligence-centred diagnostic test accuracy studies: the STARD-AI protocol. BMJ Open. 2021; 11(6): e047709. [CrossRef] [PubMed]
Vasey B, Nagendran M, Campbell B, et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat Med. 2022; 28(5): 924–933. [CrossRef] [PubMed]
Cruz Rivera S, Liu X, Chan A-W, Denniston AK, Calvert MJ, SPIRIT-AI and CONSORT-AI Working Group. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI Extension. BMJ. 2020; 370: m3210. [PubMed]
Liu X, Cruz Rivera S, Moher D, Calvert MJ, Denniston AK, SPIRIT-AI and CONSORT-AI Working Group. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI Extension. BMJ. 2020; 370: m3164. [PubMed]
Elvidge J, Hawksworth C, Avşar TS, et al. Consolidated Health Economic Evaluation Reporting Standards for interventions that use artificial intelligence (CHEERS-AI) [published online ahead of print May 23, 2024]. Value Health, https://doi.org/10.1016/j.jval.2024.05.006.
Collins GS, Moons KGM, Dhiman P, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ.2024; 385: e078378. [PubMed]
American National Standards Institute. ANSI/CTA-2089.1-2020. Definitions/characteristics of artificial intelligence in health care. Available at: https://webstore.ansi.org/standards/ansi/ansicta20892020. Accessed July 8, 2024.
Zhang J, Whebell S, Gallifant J, et al. An interactive dashboard to track themes, development maturity, and global equity in clinical artificial intelligence research. Lancet Digit Health. 2022; 4(4): e212–e213. [CrossRef] [PubMed]
Kim JY, Boag W, Gulamali F, et al. Organizational governance of emerging technologies: AI adoption in healthcare. arXiv, https://doi.org/10.48550/arXiv.2304.13081.
Greenhalgh T, Wherton J, Papoutsi C, et al. Beyond adoption: a new framework for theorizing and evaluating nonadoption, abandonment, and challenges to the scale-up, spread, and sustainability of health and care technologies. J Med Internet Res. 2017; 19(11): e367. [CrossRef] [PubMed]
Martindale APL, Ng B, Ngai V, et al. Concordance of randomised controlled trials for artificial intelligence interventions with the CONSORT-AI reporting guidelines. Nat Commun. 2024; 15(1): 1619. [CrossRef] [PubMed]
U.S. Food & Drug Administration. Artificial intelligence and machine learning (AI/ML)-enabled medical devices. Available at: https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-aiml-enabled-medical-devices. Accessed July 8, 2024.
Wu E, Wu K, Daneshjou R, Ouyang D, Ho DE, Zou J. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat Med. 2021; 27(4): 582–584. [CrossRef] [PubMed]
Hoffmann TC, Glasziou PP, Boutron I, et al. Better reporting of interventions: template for intervention description and replication (TIDieR) checklist and guide. BMJ. 2014; 348: g1687. [CrossRef] [PubMed]
Hogg H, Al-Zubaidy M, Keane PA, Hughes G, Beyer FR, Maniatopoulos G. Evaluating the translation of implementation science to clinical artificial intelligence; a bibliometric study of qualitative research. Front Health Serv. 2023; 3: 1161822. [CrossRef] [PubMed]
Pinnock H, Barwick M, Carpenter CR, et al. Standards for Reporting Implementation Studies (StaRI) statement. BMJ. 2017; 356: i6795. [PubMed]
Skivington K, Matthews L, Simpson SA, et al. A new framework for developing and evaluating complex interventions: update of Medical Research Council guidance. BMJ. 2021; 374: n2061. [PubMed]
National Academy of Sciences. Reproducibility and Replicability in Science. Washington, DC: National Academies Press; 2019. [PubMed]
EQUATOR Network. Enhancing the QUAlity and Transparency Of health Research. Available from: https://www.equator-network.org/. Accessed July 8, 2024.
Schulz KF, Altman DG, Mohrer D. CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. BMJ. 2010; 340: c332. [CrossRef] [PubMed]
Lu JH, Callahan A, Patel BS, et al. Assessment of adherence to reporting guidelines by commonly used clinical prediction models from a single vendor: a systematic review. JAMA Netw Open. 2022; 5(8): e2227779. [CrossRef] [PubMed]
Lyell D, Coiera E, Chen J, Shah P, Magrabi F. How machine learning is embedded to support clinician decision making: an analysis of FDA-approved medical devices. BMJ Health Care Inform. 2021; 28(1): e100301. [CrossRef] [PubMed]
Heydon P, Egan C, Bolter L, et al. Prospective evaluation of an artificial intelligence-enabled algorithm for automated diabetic retinopathy screening of 30 000 patients. Br J Ophthalmol. 2021; 105(5): 723–728. [CrossRef] [PubMed]
Norgeot B, Quer G, Beaulieu-Jones BK, et al. Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist. Nat Med. 2020; 26(9): 1320–1324. [CrossRef] [PubMed]
Henry KE, Kornfield R, Sridharan A, et al. Human–machine teaming is key to AI adoption: clinicians’ experiences with a deployed machine learning system. NPJ Digit Med. 2022; 5(1): 97. [CrossRef] [PubMed]
Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ. 2015; 350: g7594. [CrossRef] [PubMed]
Lebovitz SH, Lifshitz-Assaf H, Levina N. To engage or not to engage with AI for critical judgments: how professionals deal with opacity when using AI for medical diagnosis. Organ Sci. 2022; 33(1): 126–148. [CrossRef]
Chan AW, Tetzlaff JM, Altman DG, et al. SPIRIT 2013 statement: defining standard protocol items for clinical trials. Ann Intern Med. 2013; 158(3): 200–207. [CrossRef] [PubMed]
Parikh RB, Helmchen LA. Paying for artificial intelligence in medicine. NPJ Digit Med. 2022; 5(1): 63. [CrossRef] [PubMed]
Hendrix N, Veenstra DL, Cheng M, Anderson NC, Verguet S. Assessing the economic value of clinical artificial intelligence: challenges and opportunities. Value Health. 2022; 25(3): 331–339. [CrossRef] [PubMed]
von Elm E, Altman DG, Egger M, et al. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. BMJ. 2007; 335(7624): 806–808. [CrossRef] [PubMed]
Buccheri RK, Sharifi C. Critical appraisal tools and reporting guidelines for evidence-based practice. Worldviews Evid Based Nurs. 2017; 14(6): 463–472. [CrossRef] [PubMed]
Whiting PF, Rutjes AWS, Westwood ME, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011; 155(8): 529–536. [CrossRef] [PubMed]
Collins GS, Dhiman P, Navarro CLA, et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open. 2021; 11(7): e048008. [CrossRef] [PubMed]
Faes L, Liu X, Wagner SK, et al. A clinician's guide to artificial intelligence: how to critically appraise machine learning studies. Transl Vis Sci Technol. 2020; 9(2): 7. [CrossRef] [PubMed]
Liu Y, Chen P-HC, Krause J, Peng L. How to read articles that use machine learning: users’ guides to the medical literature. JAMA. 2019; 322(18): 1806–1816. [CrossRef] [PubMed]
Rieke N, Hancox J, Li W, et al. The future of digital health with federated learning. NPJ Digit Med. 2020; 3: 119. [CrossRef] [PubMed]
Ganapathi S, Palmer J, Alderman JE, et al. Tackling bias in AI health datasets through the STANDING Together initiative. Nat Med. 2022; 28(11): 2232–2233. [CrossRef] [PubMed]
Cabitza F, Campagner A, Ronzio L, et al. Rams, hounds and white boxes: investigating human–AI collaboration protocols in medical diagnosis. Artif Intell Med. 2023; 138: 102506. [CrossRef] [PubMed]
Figure 1.
 
A schematic of the process of intervention evaluation through four stages of research design, conduct, reporting, and appraisal. The design of clinical research establishes the maximum potential value of evidence that can be offered, with each subsequent step acting as sequential and modifiable filters on the value ultimately delivered to healthcare practice. In the scenario illustrated in the figure, it is the reporting of the research that limits the evidence it generates to inform practice, but it could be any combination of the four stages.
Figure 1.
 
A schematic of the process of intervention evaluation through four stages of research design, conduct, reporting, and appraisal. The design of clinical research establishes the maximum potential value of evidence that can be offered, with each subsequent step acting as sequential and modifiable filters on the value ultimately delivered to healthcare practice. In the scenario illustrated in the figure, it is the reporting of the research that limits the evidence it generates to inform practice, but it could be any combination of the four stages.
Figure 2.
 
Schematic of evidence generation across the translational pathway and relevant research reporting guidelines for AI-enabled interventions. Reporting guidelines specific to AI are shaded black, with other relevant guidelines shaded gray. The figure is only indicative of the primary application of the guidelines; a number of them may contain elements that are also of value in other contexts. *Guidelines with AI-specific expansions in development.
Figure 2.
 
Schematic of evidence generation across the translational pathway and relevant research reporting guidelines for AI-enabled interventions. Reporting guidelines specific to AI are shaded black, with other relevant guidelines shaded gray. The figure is only indicative of the primary application of the guidelines; a number of them may contain elements that are also of value in other contexts. *Guidelines with AI-specific expansions in development.
Table.
 
Key Considerations in Clinical Study Design for AI-Enabled Interventions and Potential Design Considerations
Table.
 
Key Considerations in Clinical Study Design for AI-Enabled Interventions and Potential Design Considerations
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×