Abstract
Purpose :
The purpose of this study was to understand characteristics of top-cited ophthalmology articles by building machine learning predictive models to predict top-cited ophthalmology articles based on bibliometric and natural language processing features. We also investigated which were the most important features for predicting a top-cited ophthalmology article. We also evaluated model performance on the subset of glaucoma-related articles.
Methods :
Ophthalmology papers published between January 2000 to June 2020 and their metadata were downloaded from Scopus, including publication year, titles, abstracts, journal, number of authors, page-range, funding source, and author and index keywords. Text of titles and abstracts were lowercased and tokenized (split into constituent words) and individual key words were represented as one-hot vectors for inputs into models. Gradient boosted machine (GBM) predictive models were created to identify whether or not each paper would be in the top 25th percentile of citations. The model was evaluated on a held-out test set on F1 score and areas under the receiver operating curve (AUROC) and precision-recall curves (AUPRC). The model was also evaluated on a subset of glaucoma-related articles, which had glaucoma-related keywords. Relative importance of features in the GBM model was determined to find the most predictive features.
Results :
The gradient boosting machine model had an AUROC of 0.846, an AUPRC of 0.531, and an F1 score of 0.206 for all ophthalmology papers, and a similar performance on the glaucoma subset (AUROC 0.885). The most influential predictive factors were standard bibliometric variables, namely publication year (133.209), paper length (74.064), and author count (31.152). Between the tokenized and scored keywords, the most influential were the index keyword study, with a relative influence of 29.882; ophthalmology, 22.177; and RNA, 19.345.
Conclusions :
This study found that natural language processing, especially with algorithmic scoring of specific keywords, is a useful tool to predict citation count in addition to standard bibliometric variables. This study also found that the most effective variables for predicting citation count are bibliometric variables, followed by certain index keywords.
This is a 2021 ARVO Annual Meeting abstract.