How robust are multiple imputation based inferences in multilevel models? (Martin Spiess, Kristian Kleinke)
Abstract: Multilevel models are usually based on strong distributional assumptions, like the multivariate normal, for all the random components. On the other hand, distributional assumptions about the covariates are usually not made. If, however, the covariates are subject to missing values and multiple imputation is chosen as a method to compensate for missing values, then one has to formulate a model for the predictive distribution of these variables, given all other variables and parameters.
These models are prone to erroneous assumptions because the corresponding conditional distributions are usually not subject of scientific interest. For example, one would probably have to model the conditional distribution of age given gender, scores from tests and questionnaires and socio-demographic variables. And to make things worse, even if the dependent variable in a regression model is conditionally normally distributed and a covariate is continuous, the reverse regression of the covariate given the dependent variable can only be a linear regression if both are jointly normally distributed. This implies that if a covariate in a multilevel model is to be imputed, then adopting an imputation model that assumes that the variable to be imputed is normally distributed given all observed variable values and parameters, is a misspecified imputation model. Since it may rarely be the case that all variables, parameters and errors are in fact normally distributed, in applications imputations are presumably often generated under misspecified models.
Thus, in this paper we present the results of a simulation study in which we investigate the consequences of misspecified imputation models on inferences in multilevel models. In particular, we consider distributions of the covariates that differ in skewness and curtosis, ignorable missing mechanisms that differ in their selectivity and different sample sizes. Finally, the results are discussed in the light of imputation methods available for multilevel models and further requirements of more flexible imputation techniques are highlighted.
Quantile regression based multiple imputation (Kristian Kleinke, Jost Reinecke, Mark Stemmler, Friedrich Lösel)
Abstract: Research by Yu, Burton, and Rivero-Arias (2007) suggests, that use of multiple imputation procedures should be avoided, when their (distributional) assumptions do not fit the empirical data at hand.
Unfortunately, standard multiple imputation (MI) solutions that are implemented in most statistical packages, like for example fully parametric MI under the multivariate normal model (norm, Schafer, 1997) or semi-parametric predictive mean matching (pmm) make assumptions (normality, homoscedasticity) that are often unrealistic in real-life-situations. Using inappropriate missing data techniques could lead to biased inferences. Kleinke (in press) has for example demonstrated that norm and pmm fail, when data are too heavily skewed.
Unfortunately, robust procedures that can cope with various aspects of “non-normality” and heterogeneity are typically not yet implemented in major statistical packages.
Therefore, one purpose of this paper is to familiarize practitioners with available robust MI options.
Secondly, evaluation studies that systematically tested the assumed robustness of these procedures are extremely scarce.
The present study contributes to this end. We compared the performance of various standard and robust MI procodedures regarding their ability to yield unbiased parameter estimates and standard errors, when the empirical data are skewed and heteroscedastic.
Performance was evaluated in a Monte Carlo Simulation based on empirical data from the Erlangen-Nuremberg development and prevention project: we predicted physical punishment by fathers of elementary school children by socio-demographic variables, fathers’ aggressiveness, dysfunctional parent-child relations and various other parenting characteristics (cf. Haupt, Lösel, & Stemmler, 2014). Haupt et al. (2014) compared results of standard OLS-regression and more robust regression procedures and deemed quantile regression an appropriate method to analyse these data. Analogously, we assumed that creating multiple imputations under a quantile regression based imputation model would be more appropriate than using regression based MI solutions that rely on a normal homoscedastic model.
We simulated 1000 complete data sets (by sampling with replacement) from the original data, introduced MAR missingness in the dependent variable and imputed the data using standard and robust MI procedures. Like Haupt et al. (2014) we analysed the complete data using quantile regression. Multiple imputation results were finally compared against the complete data results.
Based on our findings, we would like to discourage the use of standard parametric MI procedures, when empirical data are skewed and heteroscedastic, as imputations based on a normal homoscedastic model distort the original distribution of the variable quite noticeably, and lead to biased statistical inferences. Unless the missing data percentage is negligibly small, we argue in favour of using MI procedures that adequately address the relevant aspects of non-normality and heterogeneity of the empirical data at hand.