Population aging and the increase of chronic conditions incidence and prevalence produce a higher risk of hospitalization or death. This is particularly high for patients with multimorbidity leading to a great consumption of resources. Identifying as soon as possible high-risk patients becomes an important challenge to improve health care service provision and to reduce costs. Nowadays, population health management, based on intelligent models, can be used to assess the risk and identify these “complex” patients. The aim of this study is to validate machine learning algorithms (Naïve Bayes, Cart, C5.0, Conditional Inference Tree, Random Forest, Artificial Neural Network and LASSO) to predict the risk of hospitalization or death starting from administrative and socio-economic data. The study involved the residents in the Local Health Unit of Central Tuscany
The source of the data used as input for the algorithms was the mARSupio database of the Agenzia Regionale Sanità (ARS), in Florence, Tuscany, Italy. Here, patients’ privacy is protected, since personal data are hidden and each patient is identified by the IDUNI (a univocal identification code of 24 characters). Data in mARSupio are collected from the principal informative flows coming from the Tuscany Regional Health Services and from the national ISTAT census on the resident population:
data coming from the hospitals (i.e. diagnoses and procedures), from the outpatients (i.e. assistive, diagnostic and rehabilitation performances), from the pharmacies (e.g. prescribed drugs, etc.), data regarding the exemptions (both for income or diagnosis) and data coming from the last census (2011). Usually these administrative flows are almost complete because Regional Health System covers the expenses, subjected to code reporting.
The whole dataset was split into the training set (70%) and the test set (30%). Therefore, the training set had 1070801 samples, while the test set was composed by 458913 samples. Since the initial dataset was very unbalanced towards the negative output class ‘G’ (positive class ‘B’ occurred in less than the 1.5% of the samples), the training set was balanced taking one random sample every twenty samples belonging to the ‘G’ group.
At the end of this process, the training rows were reduced to 67978, where the positive samples were the 22.36% of the total ones (TABLE I). On the contrary, the test set was not modified, in order to evaluate the performances on a real sample of the Tuscan population
Population health management can be very useful to identify the target patients. It is intended as a risk assessment process for defining patients’ cohorts and stratifying members by the risk of preventable hospitalizations in order to deliver specific treatment programs according to the individual needs, with the final aim of improving the health outcomes. Such a process is based on big data analysis techniques.
There are several institutions and companies which are studying and testing models to support the GPs in selecting patients for specific care programs or to predict the risk of hospitalization or death. The existing models are based on different approaches, from statistics to machine learning, and they use administrative and/or socio-economic data. Statistical models are the most used so far.
First, to assess the performances of some machine learning algorithms: Naïve Bayes , decision trees such as CART, C5.0 and Conditional Inference Tree , Random Forest , Artificial Neural Network and LASSO , the same used in the literature , to predict avoidable hospital admissions or death and to identify the involved patients.
Secondly, to select the subset of the most important features to be considered for the patient identification, to increase the speed of the analysis of the population. The final goal will be to develop a first level screening tool to identify high-risk patients, the ones that the GPs should monitor with specific treatment programs, in order to reduce the hospitalization rate and/or postpone death
Population is getting older and the number of people suffering from multiple chronic conditions is increasing. For GPs and healthcare providers in general, it becomes crucial to identify as soon as possible the complex patients to treat them with specific program of care, in order to reduce or postpone hospitalizations or death. A possible solution to support this selection process is the development of population health management tools based on machine learning methods.
This paper presents the performance evaluation of several machine learning algorithms to solve the binary classification problem of identifying high-risk patients in the population, by analyzing different sources of administrative and socio-economic data. Among the tested algorithms, the best models in terms of PPR and F1-Score result to be Random Forest and LASSO. These models outperform the methods currently used in Tuscany region for the identification of high-risk patients (6 vs 39 for the PPR metrics). The main limitation of this approach is a quite high number of false positives. This does not represent an issue since these tools are considered for a first level of screening, and the resulting list of patients is expected to be further analyzed by the GPs to extract the final list of patients to be enrolled in dedicated treatment programs.