Purchase this article with an account.
Paul Lee, Augustine Lee, Nader Moinfar, mariangela rivera, Rebecca Metzinger; Applied Machine Learning to Medicare Utilization Data. Invest. Ophthalmol. Vis. Sci. 2017;58(8):5075.
Download citation file:
© ARVO (1962-2015); The Authors (2016-present)
Evaluate machine learning algorithms to classify factors that determine high utilization of Medicare services. Identification of such factors is important to optimize quality and resource allocation. Confluence of open source data and high-capacity computing has distributed such analysis away from specialized computing environments.
Multiple CMS and US Census datasets were combined with clinical intuition to identify attributes that might be associated with Medicare utilization pattern - gender, years in practice, population of the providers’ zip codes, participation in PQRS and participation in EHR. Attribute identification was limited to these nonclinical factors given the confines of the publicly available data sources.From this data, providers performing intravitreal injections (67028) were selected. Utilization data was normalized to reflect treatments per patient rather than the raw treatment volume. This group of values was then categorized into high (>50 percentile) and low group (<50 percentile).Linear/nonlinear classification algorithms were performed on R statistical software using the Caret package for model comparison. Linear/Logistic regression, Naïve Bayes, Support Vector Machine (SVM), linear discriminate analysis (LDA), K-nearest neighbors (KNN), Random Forest and Classification & Regression Trees (CART) algorithms were evaluated. Accuracy and kappa scores were used for comparison.
Figure 1: Min, median, mean, max of the Accuracy and Kappa scoresAs demonstrated on figure 1, K-nearest neighbor was chosen due to the best combination of accuracy and Kappa values. Further refinement of the KNN by increasing the number of neighbors to 20 (increment =1) did not significantly improve the results.
It is possible to predict some of the characteristics associated with high-utilization using the public data sources. Expansion with enhanced demographic data as well as Inclusion of clinical data would strengthen the predictive ability of such techniques. It is important to note that only quantitative conclusions can be drawn since the datasets lack any clinical data. Specifically, gender, years since graduation, population of the providers’ zip code, participation in EHR/PQRS can predict high utilization with 68% accuracy under the parameters and limitations reported. The choice of the models will be determined by trade-off between bias and variance depending on the need.
This is an abstract that was submitted for the 2017 ARVO Annual Meeting, held in Baltimore, MD, May 7-11, 2017.
This PDF is available to Subscribers Only