Benchmarking missing-values approaches for predictive models on health databases
Alexandre Perez-Lebel, Gaël Varoquaux, Marine Le Morvan, Julie Josse, Jean-Baptiste Poline
Missing Values
Machine Learning
Supervised Learning
Benchmark
Imputation
Multiple Imputation
Bagging
Abstract
BACKGROUND
As databases grow larger, it becomes harder to fully control their collection, and they frequently come with missing values: incomplete observations. These large databases are well suited to train machine-learning models, for instance for forecasting or to extract biomarkers in biomedical settings. Such predictive approaches can use discriminative --rather than generative-- modeling, and thus open the door to new missing-values strategies. Yet existing empirical evaluations of strategies to handle missing values have focused on inferential statistics.
RESULTS
Here we conduct a systematic benchmark of missing-values strategies in predictive models with a focus on large health databases: four electronic health record datasets, a population brain imaging one, a health survey and two intensive care ones. Using gradient-boosted trees, we compare native support for missing values with simple and state-of-the-art imputation prior to learning. We investigate prediction accuracy and computational time. For prediction after imputation, we find that adding an indicator to express which values have been imputed is important, suggesting that the data are missing not at random. Elaborate missing values imputation can improve prediction compared to simple strategies but requires longer computational time on large data. Learning trees that model missing values --with missing incorporated attribute-- leads to robust, fast, and well-performing predictive modeling.
CONCLUSIONS
Native support for missing values in supervised machine learning predicts better than state-of-the-art imputation with much less computational cost. When using imputation, it is important to add indicator columns expressing which values have been imputed.
Steps
Introduction
This protocol details the experiments run in the GigaScience article Benchmarking missing-values approaches for predictive models on health databases , Perez-Lebel et al. 2022. The code used for running the experiments and plotting the results is available on GitHub:
Software
Value | Label |
---|---|
Benchmarking missing-values approaches for predictive models... | NAME |
https://github.com/alexprz/article-benchmark_mv_approaches | REPOSITORY |
Alexandre Perez-Lebel | DEVELOPER |
And can be installed through the following steps:
#Install
git clone https://github.com/aperezlebel/benchmark_mv_approaches.git
cd benchmark_mv_approaches
conda install --file requirements.txt
Data
We benchmarked 12 supervised predictive methods on 13 prediction tasks taken from 4 health databases.
Each one of the 4 databases needs to be downloaded separately from their respective source project. Access to Traumabase, UK BioBank and MIMIC-III, requires an application. NHIS is freely available. Once downloaded, data path of each database can be updated in the TB.py, UKBB.py, MIMIC.py and NHIS.py files which are in the database/ folder of the project.
Data can be obtained by contacting the team on the Traumabase website.
The data are available upon application as detailed on the UK BioBank website.
The data can be accessed via an application described on the MIMIC website. Note that, as of the time of writing, the completion of an online MIT course is required for the application. We used the 1.4 version of the data in the project.
Prediction tasks
From these databases, we defined 13 prediction tasks. That is, a set of input features and an outcome to predict. All features of each task belong to the same database.
Available tasks can be obtained with:
#Available tasks
python main.py info available -t
```Names of the available tasks are:
<Note title="Citation" type="success" ><span>TB/death_pvals</span><span>TB/platelet_pvals</span><span>TB/hemo</span><span>TB/hemo_pvals</span><span>TB/septic_pvals</span><span>UKBB/breast_25</span><span>UKBB/breast_pvals</span><span>UKBB/skin_pvals</span><span>UKBB/parkinson_pvals</span><span>UKBB/fluid_pvals</span><span>MIMIC/septic_pvals</span><span>MIMIC/hemo_pvals</span><span>NHIS/income_pvals</span></Note>
Predictive methods
36 predictive methods are available. The list of their IDs and names can be obtained running:
#Available models
python main.py info available -m
```IDs and names of the available methods are:
<Note title="Citation" type="success" ><span>0: Classification</span><span>1: Classification_Logit</span><span>2: Regression</span><span>3: Regression_Ridge</span><span>4: Classification_imputed_Mean</span><span>5: Classification_Logit_imputed_Mean</span><span>6: Regression_imputed_Mean</span><span>7: Regression_Ridge_imputed_Mean</span><span>8: Classification_imputed_Mean+mask</span><span>9: Classification_Logit_imputed_Mean+mask</span><span>10: Regression_imputed_Mean+mask</span><span>11: Regression_Ridge_imputed_Mean+mask</span><span>12: Classification_imputed_Med</span><span>13: Classification_Logit_imputed_Med</span><span>14: Regression_imputed_Med</span><span>15: Regression_Ridge_imputed_Med</span><span>16: Classification_imputed_Med+mask</span><span>17: Classification_Logit_imputed_Med+mask</span><span>18: Regression_imputed_Med+mask</span><span>19: Regression_Ridge_imputed_Med+mask</span><span>20: Classification_imputed_Iterative</span><span>21: Classification_Logit_imputed_Iterative</span><span>22: Regression_imputed_Iterative</span><span>23: Regression_Ridge_imputed_Iterative</span><span>24: Classification_imputed_Iterative+mask</span><span>25: Classification_Logit_imputed_Iterative+mask</span><span>26: Regression_imputed_Iterative+mask</span><span>27: Regression_Ridge_imputed_Iterative+mask</span><span>28: Classification_imputed_KNN</span><span>29: Classification_Logit_imputed_KNN</span><span>30: Regression_imputed_KNN</span><span>31: Regression_Ridge_imputed_KNN</span><span>32: Classification_imputed_KNN+mask</span><span>33: Classification_Logit_imputed_KNN+mask</span><span>34: Regression_imputed_KNN+mask</span><span>35: Regression_Ridge_imputed_KNN+mask</span></Note> _Classification_ and _Regression_ code respectively for HistGradientBoostingClassifier and HistGradientBoostingRegressor from scikit-learn. _Classification_Logit_ and _Regression_Ridge_ code respectively for linear models Logit and Ridge used in the supplementary experiment. To each of these 4 base codes can be appended the name of an imputer (eg _ _imuted_Mean, _Imputed_Med, ...)_ with or without the mask (eg _ _imuted_Mean, _Imputed_Mean+mask, ...)._ Whether to use Bagging can be specified later as explained in the _Prediction_ section of this protocol.
Feature selection
11 tasks have their features automatically selected with a simple ANOVA-based univariate test of the link of each feature to the outcome (task name ends with "_pvals" in the code and "_screening" in the article).
The 2 remaining tasks have their feature manually defined following the choices of experts in prior studies.
ANOVA-based feature selection
Categorical features are first one-hot encoded. Then, the ANOVA-based univariate test is performed on one third of the samples which are then discarded. We kept the 100 encoded features having the smallest 100 p-values. Once the features are selected, the cross-validated prediction is performed on the remaining two thirds of the samples.
For these tasks, there are 5 trials during which the samples on which the selection test is performed are redrawn, and the prediction each time fitted on the new remaining samples and the new selected features.
We used f_classif and f_regression from the feature_selection module of scikit-learn.
For each of these tasks, p-values of the test can be computed for each trial by running:
#Feature selection
python main.py select {task_name} --T {T}
```Be careful to replace placeholders {task_name} and {T} by the name of the task and the trial ID (0 to 4) respectively.
Example:
#Example of feature selection python main.py select TB/death_pvals --T 0
<Note title="Safety information" type="error" ><span>Note that these commands will fail without the data and without the types of the features.</span></Note>
Manual selection following experts
Features for the hemorrhagic shock prediction (task named TB/hemo) in the Traumabase database are defined following Jiang et al.:
Prediction
Scale
To study the influence of the scale on the results, we decided to work on 4 sizes of the training set: 2 500, 10 000, 25 000 and 100 000. For each one of these sizes are run the following operations.
Nested cross-validations
Two nested cross-validations are used. The outer one yields 5 training and test sets. The training set has 2 500, 10 000, 25 000 or 100 000 samples depending on the scale. The test set is composed of all the remaining samples. Note that the size of the test set is considerably larger with a train set of 2 500 samples than with 100 000. On each training set, we perform a cross-validated hyper-parameter search –the inner cross-validation– and select the best hyper-parameters. We evaluate the best model on the respective test set. We assess the quality of the prediction with a coefficient of determination for regressions and the area under the ROC curve for classification. We average the scores obtained on the 5 test sets of the outer cross-validation to give the final score.
The test set size is at least 10% the size of the training set. If a prediction task has not enough samples once the feature selection is performed (eg 110 000 samples for the 100 000 scale), it is skipped for the corresponding scale. As a result the biggest scale has fewer available tasks than the smallest one (resp. 4 against 13).
To draw the 5 folds, we used StratifiedShuffleSplit (resp. ShuffleSplit ) from scikit-learn for classifications (resp. regressions). We used GridSearchCV from scikit-learn to perform the cross-validated hyper-parameters tuning.
Evaluating a method on a prediction task is done by running:
#Prediction
python main.py predict {task_name} {method_id} --T {T}
```Be careful to replace placeholders {task_name}, {method_id} and {T} by the name of the task, the ID or name of the method and the trial ID (0 to 4) respectively.
Example:
#Prediction example python main.py predict TB/death_pvals 0 --T 0
Some methods of the benchmark use bagging. To add bagging to an available method, specify the number of estimators you want in the ensemble with the _nbagging_ option. For instance:
#Prediction with bagging python main.py predict TB/death_pvals 0 --T 0 --nbagging 100
<Note title="Safety information" type="error" ><span>Note that these commands will fail without the data and without the types of the features.</span></Note>
Results are dumped in the _results/_ folder.
To run the full benchmark of the article, we needed 520 000 CPU hours.
`520000h 0m 0s`
Imputation
5 imputation methods are available:
- Imputation with the mean.
- Imputation with the median.
- Iterative imputation.
- Imputation with the nearest neighbors.
- Multiple Imputation using Bagging. For each of them, new binary features can be added to the data. This binary mask encodes whether a value was originally missing or not.
The imputer is fitted on the train set only and both the train and test sets are then imputed with the fitted imputer. Doing so avoids leaking information from the train set to the test set and then helps to avoid overfitting.
We used SimpleImputer , IterativeImputer, KNNImputer, BaggingClassifier and BaggingRegressor from scikit-learn.
Results
Once the results of all the methods are obtained, they are gathered in a single CSV file using the following command:
#Aggregate results
python main.py aggregate --root results/
```This creates a _scores.csv_ file in the _scores/_ folder.
**The aggregated results obtained during our experiment are given in our repository.** [are given in our repository.](https://github.com/aperezlebel/benchmark_mv_approaches/blob/2ed30c0ffffa93f0398731b11b9202523c4da96f/scores/merged_scores.csv) This allows to reproduce the figures and tables and to analyze further the results without needing the original data.
Figures and tables
Most of the figures and tables of our article can be easily reproduced without requiring the original data (based on saved results only). Use commands defined in the Makefile to easily reproduce figures and tabs. (Note that some commands will fail because they require data or raw results that are not available in the repository).
All figures and tables are saved in the graphics/ folder of the repository.
The main figure can be reproduced with:
#Main figure
python main.py figs boxplot
```or
#Main figure make boxplot
<Note title="Citation" type="success" ><img src="https://content.protocols.io/files/gb9pbgh87.jpg" alt="Main figure part 1: Comparison of prediction performance across the 12 methods for 13 prediction tasks spread over 4 databases, and for 4 sizes of dataset (2 500, 10 000, 25 000 and 100 000 samples)." loading="lazy" title="Main figure part 1: Comparison of prediction performance across the 12 methods for 13 prediction tasks spread over 4 databases, and for 4 sizes of dataset (2 500, 10 000, 25 000 and 100 000 samples)."/><img src="https://content.protocols.io/files/gb9tbgh87.jpg" alt="Main figure part 2: Comparison of training times across the 12 methods for 13 prediction tasks spread over 4 databases, and for 4 sizes of dataset (2 500, 10 000, 25 000 and 100 000 samples)." loading="lazy" title="Main figure part 2: Comparison of training times across the 12 methods for 13 prediction tasks spread over 4 databases, and for 4 sizes of dataset (2 500, 10 000, 25 000 and 100 000 samples)."/></Note>