1. Introduction
Analysing Real-World Evidence (RWE) datasets in clinical research have increasing utility in overcoming some limitations of conventional Randomised Clinical Trials (RCT), such as target sourcing and early patient stratification. One of the biggest challenges when working with RWE datasets is that experimentation is limited, and variables are often incomplete. Missing data appear as gaps in the data set that hide meaningful values for analysis. As Little put it, “the best resolution for handling missing data is not to have missing data” [
1]. However, analysing RWE datasets poses the challenge of handling missing values. Indeed, missing data are found not only in observational RWE datasets but also in RCT [
2]. Using RWE effectively to advance current medical practice requires finding better solutions to the missing value problem.
In RWE datasets, there is a need to understand which biomarkers best correlate with clinical outcomes to facilitate drug development. A substantial difficulty in real-world settings is that several biomarkers statuses may be missing in the dataset, hiding meaningful information in the analysis. Hence, excluding the underlying value of missing data may invalidate the results. There are many practical implications when missing data are present; for example, it can lower the power, affect the precision of the confidence intervals for parameter estimates, and lead to biased estimates.
We study the scenario where the observed features, sometimes called covariates of interest, are possibly incomplete. Let us denote the features by
X. The missingness mechanism
places the missing values in
masking the actual value
X, i.e.,
is a random variable taking values in
, so that:
Rubin et al. [
3] defined three possible scenarios for missing data: Missingness Complete at Random (MCAR), Missingness at Random (MAR), and Missingness Not at Random (MNAR). All types of Missingness can be classified usefully into this taxonomy, which indicates the most appropriate procedure to minimise bias. We say that the data are MCAR if the complete dataset
X do not influence the mask
, i.e.,
X are independent of
. In large sample theory, we may test for MCAR from the data by conducting Little’s test [
4], which compares the change in the empirical mean of measured variables
if removing cases with missing values. In MAR, other measured variables are influencing the missingness mechanism. The consequences of MAR are diverse but addressable by controlling for other known variables and imputation. In analysing RWE datasets, it is helpful to assume MAR because it allows handling missing data with general-purpose imputation algorithms, the central topic of this article. The MNAR scenario is complicated, and if present, it means that
may induce a statistical bias when estimating parameters with missing values. It is also impossible to test whether
influences its missingness mechanism,
, with the given data [
5]. Possible resolutions are to model the MNAR mechanism, collect more data on the missing variable, or collect information on other variables that allow modelling MNAR as a MAR scenario.
A conventional ad-hoc method to handle missing data is the complete case analysis, to delete any rows or columns with missing variables. The problem with complete case analysis is that it squanders information reducing the sample size considerably, especially in RWE scenarios where incomplete cases may be frequent. Imputation algorithms are general strategies that replace missing values (n.a.) with plausible values. Nevertheless, replacing missing values with single static values cannot be completely representative of the missing sample. After all, imputed values are estimated, not observed. Therefore, it is often more appropriate to apply a random variable approach to represent missing values. Rubin et al. [
6] proposed multiple imputations (MI) for survey non-responders to tackle the uncertainty in the missing values that single imputation cannot represent with a point estimate. The general idea of MI is to generate multiple complete datasets, analyse each dataset separately, and summarise the results. Imputation algorithms that perform MI must replace missing values with samples from the missing values’ joint probability density function. Therefore, the MI approach embraces the uncertainty in the missing values that single imputation with a point estimate cannot represent. Early papers proposing imputation algorithms for MI often apply conventional statistical methods to estimate the probability density function like expectation-maximisation (EM) [
7]. More sophisticated methods, adapting ideas from Markov-Chain Monte-Carlo [
8], dimension reduction [
9], ensemble learning [
9] and deep learning [
10], have been proposed. However, there is a lack of literature on validation, and systematic comparison of imputation methods [
11]. Furthermore, the importance of considering missingness patterns and the data distribution when comparing methods has received little attention [
12]. Neglecting to do this may lead to biased results concerning the relative performance of imputation methods.
The present paper develops a new multivariate imputation algorithm powered by a deep neural network (DNN) for tabular data named Tabnet [
13] that performs multiple imputations and supports mixed data types. We define the causal mechanism of missing data and explain the rationale for considering imputation algorithms in RWE data examples. Following best practices in developing a new algorithm for imputation [
3], we aim to find an accurate imputation algorithm that provides good statistical properties, such as unbiased parameter estimates and coverage of the parameter estimates determined from sampling and missing data variance. The RWE data source used in the present paper was the Non-Small Cell Lung Cancer (NSCLC) Flatiron database [
14], a dataset of de-identified patient-level electronic medical records in the United States spanning 280 community practices seven sizeable academic research institutions. The Flatiron NSCLC biomarker cohort potentially differs from other dense sampled datasets, where biomarkers are measured longitudinally. However, it is a typical example of the use case of RWE, where clinical interest is often on biomarkers tested at cancer diagnosis that help identify sub-populations that most benefit from targeted treatments. For NSCLC, clinical practice guidelines recommendations include testing the genomic biomarkers Epidermal growth factor receptor (EGFR), Anaplastic lymphoma kinase (ALK), Kirsten rat sarcoma (KRAS), B-RAF proto-oncogene (BRAF), and immunotherapy marker programmed death-ligand 1 receptor (PDL1).
The present article makes several contributions that we summarise as follows:
We propose a systematic approach for comparison of imputation methods on RWE datasets.
We apply this approach to compare the model performance of seven imputation methods. The six methods are expectation-maximisation (EM), predictive mean matching (PMM) with multivariate imputations by chained equations (MICE), bootstrap-based principal component analysis (MIPCA), one method that uses random forest (MIRF), generative adversarial imputation networks (GAIN) and a method that uses mice with tabular networks (MITABNET).
We conduct a comparative study of the state-of-the-art imputation algorithms in simulations and RWE data benchmarks with clinical oncology applications.
Our research develops a new multivariate imputation algorithm powered by a deep neural network (DNN) for tabular data named Tabnet [
13] that performs multiple imputations and supports mixed data types.
4. Discussion
Several methods have recently been proposed to perform multiple imputations with missing data for RWE observational datasets [
10,
11]. To our knowledge, few papers have systematically compared the statistical properties of the various methods, considering the impact of missing data and the concept drift. To help potential research on RWE datasets choose an imputation method, we have studied six methods that perform multiple imputations.
All algorithms draw multiple imputations using different but comparable techniques. PMM and MITABNET use a Gibbs sampler approach; MIRF uses different random seeds to initialise a random forest; EM and MIPCA use bootstrap-based approach; GAIN uses generative adversarial networks. A Deep Neural Network powers both GAIN and MITABNET, where multiple imputations may be drawn by applying dropout layers [
28] at training and predicting imputations time. To our knowledge, we are the first to investigate the usefulness of Tabnet as an algorithm for multiple imputations (MITABNET) and systematically compared it with state-of-the-art methods. MITABNET can become part of the pre-processing step of covariates for RWE dataset analysis, combining the interpretation of multiply imputed datasets for more robust inference.
In this article, we have focused on finding the best imputation method for realistically complex datasets. Our synthetic data experiment used a structural causal model to sample multivariate datasets with different levels of correlation among the observed features. As seen in
Figure 6 and
Table 3, all methods perform best when the correlation among variables is high. Our results agree with previous research showing that the best-case setting for applying off-the-shelve imputation algorithms is the MAR mechanism with a high correlation between variables. The synthetic data experiment found that MITABNET and GAIN outperformed every other algorithm in high and low correlation settings, using accuracy and percentage bias. However, analysing the RWE NSCLC FlatIron dataset did not find conclusive results of the best method considering missingness impact and concept drift. Only three methods, MIRF, MITABNET and PMM, achieved low percentage bias for the scenario where the concept drift was in favour of them, also showing low percentage bias for the impact of missing data <20%. As seen in
Table 5, PMM achieved consistently acceptable coverage >50%, only outperformed under the concept drift of MIRF and MITABNET. On the other hand, PMM also had the most extensive confidence intervals across all imputation algorithms.
We analysed the bias and coverage of parameter estimates after imputing with several imputation algorithms, extending the approach for a standardised evaluation of imputation algorithms from [
10,
11], which concluded that MIRF or GAIN result in more accurate imputation and sharper inference than other imputation algorithms. Our synthetic data results indicate that tabular networks may outperform randomised decision trees and generative adversarial networks for low and high correlated datasets with the structural causal model used previously by [
27], being less biased and hence, preferred for sharper inference.
5. Limitations
Although we performed the present comparative study with realistically complex analyses and real-world data, it has limitations. The most critical limitation is that our results are dataset-dependent. A limitation of our imputation algorithm is that to avoid an excessive computational burden, we only performed five multiple imputed datasets for each method, leading to potentially noisy between-imputation variability. For realistic analysis, [
25] recommended estimating the number of imputations necessary to produce efficient estimates by conducting a relative efficient analysis of the fraction of missing information [
29]. Nevertheless, the default choice for the most popular multiple imputations packages is five [
8], and although we evaluated the convergence of the algorithms, it is possible that analysing RWE datasets need more imputations to produce efficient estimates.
Finally, our study focused on MAR missingness patterns. However, an MNAR missing data pattern may be unknown in practice, and results should be generalised with caution. Alternatives to pre-canned algorithms, such as full information maximum likelihood [
30], and full Bayesian imputation [
5], where the missing values’ model assumptions are explicit in the model formulation, may be more appropriate for MNAR settings. However, full information maximum likelihood and fully Bayesian approaches require extra engineering steps to include the missing variables in the model and are out of the scope of this analysis. Algorithms for multiple imputations such as MITABNET work well for MAR and remain the standard approach for handling missing data with imputation algorithms [
8,
31].