Quantile regression based multiple imputation

Conference of the European Survey Research Association (ESRA), Lisbon, Portugal

Authors

Kristian Kleinke

Jost Reinecke

Mark Stemmler

Friedrich Lösel

Date

July 20, 2017

Abstract

Research by Yu, Burton, and Rivero-Arias (2007) suggests, that use of multiple imputation procedures should be avoided, when their (distributional) assumptions do not fit the empirical data at hand. Unfortunately, standard multiple imputation (MI) solutions that are implemented in most statistical packages, like for example fully parametric MI under the multivariate normal model (norm, Schafer, 1997) or semi-parametric predictive mean matching (pmm) make assumptions (normality, homoscedasticity) that are often unrealistic in real-life-situations. Using inappropriate missing data techniques could lead to biased inferences. Kleinke (in press) has for example demonstrated that norm and pmm fail, when data are too heavily skewed. Unfortunately, robust procedures that can cope with various aspects of non-normality and heterogeneity are typically not yet implemented in major statistical packages. Therefore, one purpose of this paper is to familiarize practitioners with available robust MI options. Secondly, evaluation studies that systematically tested the assumed robustness of these procedures are extremely scarce. The present study contributes to this end. We compared the performance of various standard and robust MI procodedures regarding their ability to yield unbiased parameter estimates and standard errors, when the empirical data are skewed and heteroscedastic. Performance was evaluated in a Monte Carlo Simulation based on empirical data from the Erlangen-Nuremberg development and prevention project: we predicted physical punishment by fathers of elementary school children by socio-demographic variables, fathers’ aggressiveness, dysfunctional parent-child relations and various other parenting characteristics (cf. Haupt, Lösel, & Stemmler, 2014). Haupt et al. (2014) compared results of standard OLS-regression and more robust regression procedures and deemed quantile regression an appropriate method to analyse these data. Analogously, we assumed that creating multiple imputations under a quantile regression based imputation model would be more appropriate than using regression based MI solutions that rely on a normal homoscedastic model. We simulated 1000 complete data sets (by sampling with replacement) from the original data, introduced MAR missingness in the dependent variable and imputed the data using standard and robust MI procedures. Like Haupt et al. (2014) we analysed the complete data using quantile regression. Multiple imputation results were finally compared against the complete data results. Based on our findings, we would like to discourage the use of standard parametric MI procedures, when empirical data are skewed and heteroscedastic, as imputations based on a normal homoscedastic model distort the original distribution of the variable quite noticeably, and lead to biased statistical inferences. Unless the missing data percentage is negligibly small, we argue in favour of using MI procedures that adequately address the relevant aspects of non-normality and heterogeneity of the empirical data at hand.