June 2021
Volume 62, Issue 8
Open Access
ARVO Annual Meeting Abstract  |   June 2021
Predicting High Impact Ophthalmology Articles Using Machine Learning and Natural Language Processing
Author Affiliations & Notes
  • Yash Karandikar
    Occidental College, Los Angeles, California, United States
  • Sophia Y Wang
    Stanford University, Stanford, California, United States
  • Footnotes
    Commercial Relationships   Yash Karandikar, None; Sophia Wang, None
  • Footnotes
    Support  None
Investigative Ophthalmology & Visual Science June 2021, Vol.62, 998. doi:
  • Views
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Yash Karandikar, Sophia Y Wang; Predicting High Impact Ophthalmology Articles Using Machine Learning and Natural Language Processing. Invest. Ophthalmol. Vis. Sci. 2021;62(8):998.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose : The purpose of this study was to understand characteristics of top-cited ophthalmology articles by building machine learning predictive models to predict top-cited ophthalmology articles based on bibliometric and natural language processing features. We also investigated which were the most important features for predicting a top-cited ophthalmology article. We also evaluated model performance on the subset of glaucoma-related articles.

Methods : Ophthalmology papers published between January 2000 to June 2020 and their metadata were downloaded from Scopus, including publication year, titles, abstracts, journal, number of authors, page-range, funding source, and author and index keywords. Text of titles and abstracts were lowercased and tokenized (split into constituent words) and individual key words were represented as one-hot vectors for inputs into models. Gradient boosted machine (GBM) predictive models were created to identify whether or not each paper would be in the top 25th percentile of citations. The model was evaluated on a held-out test set on F1 score and areas under the receiver operating curve (AUROC) and precision-recall curves (AUPRC). The model was also evaluated on a subset of glaucoma-related articles, which had glaucoma-related keywords. Relative importance of features in the GBM model was determined to find the most predictive features.

Results : The gradient boosting machine model had an AUROC of 0.846, an AUPRC of 0.531, and an F1 score of 0.206 for all ophthalmology papers, and a similar performance on the glaucoma subset (AUROC 0.885). The most influential predictive factors were standard bibliometric variables, namely publication year (133.209), paper length (74.064), and author count (31.152). Between the tokenized and scored keywords, the most influential were the index keyword study, with a relative influence of 29.882; ophthalmology, 22.177; and RNA, 19.345.

Conclusions : This study found that natural language processing, especially with algorithmic scoring of specific keywords, is a useful tool to predict citation count in addition to standard bibliometric variables. This study also found that the most effective variables for predicting citation count are bibliometric variables, followed by certain index keywords.

This is a 2021 ARVO Annual Meeting abstract.

×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×