Chapter 3 Why we should not resort to proxy solutions to impute heavily skewed count data
Previous missing data research discourages the use of imputation models that do not have a good fit to the distribution of the empirical data at hand (Kleinke, 2017; Yu et al., 2007). When data are zero-inflated counts (which are typically heavily skewed), an appropriate imputation model (and analysis model) would be a zero-inflation model or a hurdle model (Hilbe, 2011; Lambert, 1992; Mullahy, 1986). Apart from using adequate count data imputation models – which until recently were not yet available, proxy strategies to impute these kinds of data include:
- to ignore the fact that count data are (often skewed) non-negative integer values, and to use standard procedures for example based on OLS regression,
- to apply some normalizing transformation like a log or square root transformation, followed by normal model multiple imputation and backtransformation to the original scale afterwards, or
- to treat the data as ordered categorical and use an ordinal logistic regression imputation model.
van Buuren (2012), Chapter 3.6.1 furthermore recommends the following strategies to impute count data:
- predictive mean matching
- ordered categorical regression
- (zero-inflated) Poisson regression
- (zero-inflated) negative binomial regression.
The transformation – imputation – backtransformation approach has been implemented in Schafer’s norm
for Windows software from 1999 (which is unfortunately no longer available). Back then, norm
was one of the very few user friendly software solutions available. Transforming non-normal variables before the imputation to make the normality assumption of the imputation model more plausible, was one of the few options practitioners had at that time to handle missing data in non-normal data. Today, missing data researchers (e.g. von Hippel, 2013) usually discourage the use of transformations to make the normality assumption of an imputation model more plausible. Usually, better alternatives are available. Regarding ordered categorical regression, we are currently not aware of any systematic evaluations.
In this section, we would like to discuss these recommendations from 2012 in the light of current missing data research, and want to corroborate our points by a Monte Carlo simulation. van Buuren (2012) writes that it is not ``yet clear whether one of the four methods consistently outperforms the others. Until more detailed results become available, my advice is to use predictive mean matching for count data’’ (p. 78).
Firstly, we absolutely agree that more systematic simulation studies are needed before we can provide sound practical guidelines regarding which method is the best for a given scenario.
Furthermore, we also agree that predictive mean matching is a good allround method that can be recommended for many scenarios including count data that are not too severely skewed. Kleinke (2017), for example, who has simulated incomplete Poisson distributed data, has shown that PMM yielded acceptable results when skewness was only mild to moderate and when not too many data had to be imputed. However, inferences were biased, when count data were more heavily skewed and the missing data percentage was substantial.
Since zero-inflated data are usually heavily skewed, we assume that PMM might not be an option for the imputation of zero-inflated count data.
Secondly, as already mentioned, we are not aware of systematic simulations that have tested how appropriate ordered categorical imputation is for zero-inflated Poisson or negative binomial data. Generally speaking, (ordered) categorical imputation could be feasible, when the number of categories is not too large (otherwise, one might run into estimation problems / empty cell problems) (van Buuren & Groothuis-Oudshoorn, 2011). van Buuren & Groothuis-Oudshoorn (2011) mention that imputation of variables with more than 20 categories are often problematic. Therefore, if the range of the count variable lies between 0 and 20, using mice.impute.polr
to create the imputations might be an option. In our simulation, we wanted to test, if this is an appropriate strategy for zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) data.
Thirdly, Kleinke & Reinecke (2013) have demonstrated that MI based on a ZIP or ZINB imputation model produced acceptable results when the data were in fact zero-inflated Poisson or zero-inflated negative binomial respectively. However,we did not compare ZIP and ZINB imputation against predictive mean matching or ordered categorical imputation.
We therefore want
- to explore, if PMM can be safely used for the imputation of ZIP and ZINB data,
- to evaluate the appropriateness of ordered categorical imputation in these scenarios, and
- to compare the results of these strategies against results based on ZIP and ZINB imputation models.
To these ends, we re-analyse the data from Kleinke & Reinecke (2013).
3.1 Method
Details about the simulation set-up are given in Kleinke & Reinecke (2013). Simulated data sets contained four variables. \(y\) was the dependent incomplete zero-inflated count variable, which – depending on the respective condition – was either ZIP or ZINB. Data sets also included three continuous predictors: \(x_1\), \(x_2\), and \(z_1\), with \(x_1\) and \(x_2\) being the covariates in the count part of the model, and \(z\) the covariate in the zero-inflation part. Population parameters were set to \(\beta_0=1\), \(\beta_1=.3\), \(\beta_2=.3\), \(\gamma_0=0\), \(\gamma_1=2\), with \(\beta\) being the parameters in the count part and \(\gamma\) the parameter in the zero part of the model. In the ZINB conditions, the dispersion parameter was set to 1. For both the ZIP and ZINB conditions 1000 data sets with sample sizes \(N\in\{500,1000,10000\}\) were simulated respectively. MAR missingness in \(y\) was introduced as outlined in Kleinke & Reinecke (2013). Missing data were imputed using functions mice.impute.pmm
(predictive mean matching) and mice.impute.polr
(ordered polytomous regression). Repeated data analysis and pooling of results was done as described in Kleinke & Reinecke (2013).
3.2 Results
PMM results are given in Tables 3.1 and 3.2. The tables display the average estimate of the respective parameter \(\widehat{Q}\) across the 1000 replications, its standard deviation, bias, which is defined as the average absolute distance between the simulated true parameter and its estimate across the 1000 replications, and 95% coverage rates, the percentage of intervals that include the true parameter. Obviously, \(\widehat{Q}\) should be close to true parameters \(Q\). Furthermore, the standard deviation of the estimates across the replications should be small (which in combination with an accurate \(\widehat{Q}\) reflects a consistently good estimation across the replications). Finally, 95% coverage should be close to 95%. Schafer & Graham (2002) deem values below 90% as serious undercoverage.
N | Parameter | \(\widehat{Q}\) | \(SD_{\widehat{Q}}\) | Bias | Coverage Rate |
---|---|---|---|---|---|
500 | \(\beta_0\) | 0.944 | 0.062 | 0.056 | 87.4 |
500 | \(\beta_1\) | 0.240 | 0.052 | 0.060 | 79.2 |
500 | \(\beta_2\) | 0.249 | 0.051 | 0.051 | 86.7 |
500 | \(\gamma_0\) | -0.183 | 0.172 | 0.183 | 86.8 |
500 | \(\gamma_1\) | 1.719 | 0.235 | 0.281 | 79.1 |
1000 | \(\beta_0\) | 0.947 | 0.042 | 0.053 | 80.9 |
1000 | \(\beta_1\) | 0.243 | 0.037 | 0.057 | 67.2 |
1000 | \(\beta_2\) | 0.252 | 0.036 | 0.048 | 77.8 |
1000 | \(\gamma_0\) | -0.185 | 0.121 | 0.185 | 74.6 |
1000 | \(\gamma_1\) | 1.707 | 0.165 | 0.293 | 63.3 |
10000 | \(\beta_0\) | 0.946 | 0.014 | 0.054 | 2.8 |
10000 | \(\beta_1\) | 0.244 | 0.011 | 0.056 | 0.7 |
10000 | \(\beta_2\) | 0.250 | 0.011 | 0.050 | 2.3 |
10000 | \(\gamma_0\) | -0.191 | 0.038 | 0.191 | 0.3 |
10000 | \(\gamma_1\) | 1.699 | 0.052 | 0.302 | 0.2 |
N | Parameter | \(\widehat{Q}\) | \(SD_{\widehat{Q}}\) | Bias | Coverage Rate |
---|---|---|---|---|---|
500 | \(\beta_0\) | 0.938 | 0.121 | 0.062 | 92.6 |
500 | \(\beta_1\) | 0.265 | 0.096 | 0.035 | 93.8 |
500 | \(\beta_2\) | 0.271 | 0.099 | 0.029 | 93.9 |
500 | \(\gamma_0\) | -0.195 | 0.300 | 0.195 | 96.0 |
500 | \(\gamma_1\) | 1.729 | 0.326 | 0.271 | 84.9 |
1000 | \(\beta_0\) | 0.939 | 0.085 | 0.061 | 91.5 |
1000 | \(\beta_1\) | 0.266 | 0.068 | 0.034 | 92.2 |
1000 | \(\beta_2\) | 0.270 | 0.067 | 0.030 | 95.2 |
1000 | \(\gamma_0\) | -0.194 | 0.211 | 0.194 | 92.0 |
1000 | \(\gamma_1\) | 1.709 | 0.220 | 0.291 | 77.7 |
10000 | \(\beta_0\) | 0.939 | 0.026 | 0.061 | 42.1 |
10000 | \(\beta_1\) | 0.269 | 0.022 | 0.031 | 69.8 |
10000 | \(\beta_2\) | 0.271 | 0.021 | 0.029 | 75.8 |
10000 | \(\gamma_0\) | -0.196 | 0.065 | 0.196 | 20.6 |
10000 | \(\gamma_1\) | 1.694 | 0.066 | 0.306 | 3.0 |
When we first look at results of the ZIP conditions, we see that estimates \(\widehat{Q}\) are very similar – regardless of sample size. However, both parameters in the zero model and in the count model are underestimated.
Furthermore, note that coverage rates decrease dramatically with increasing sample size. This is because standard error estimates depend on sample size. An increasing sample size leads to a smaller standard error estimate. Confidence intervals therefore become narrower, and even samller biases could result in significant undercoverage (i.e. less intervals include the true parameter).
Let us turn to the ZINB conditions: Here, biases are especially noticeable in the zero part of the model. Though parameters of the count part of the model are also underestimated (especially \(\beta_1\)), corresponding coverage rates are acceptably large, when \(N=500\) or \(N=1000\). In the large sample size condition, coverage, however, falls below 90% for all parameters, and, in the worst case, droppes down to 3% for parameter \(\gamma_1\).
Results of polytomous regression (function mice.impute.polr
) are given in Tables 3.3 and 3.4. Coefficients of both the zero and the count part of the model are usually underestimated and coverage rates of most model parameters are usually unacceptably low regardless of sample size.
We conclude that – in this scenario – where data were zero-inflated Poisson or negative binonial and severely skewed, both predictive mean matching and ordered polytomous regression produced inadequate results and biased statistical inferences. When we compare results to the ones reported in Kleinke & Reinecke (2013), we see that using an imputation model with fitting distributional assumptions is highly recommendable.
N | Parameter | \(\widehat{Q}\) | \(SD_{\widehat{Q}}\) | Bias | Coverage Rate |
---|---|---|---|---|---|
500 | \(\beta_0\) | 0.997 | 0.055 | 0.003 | 95.0 |
500 | \(\beta_1\) | 0.284 | 0.046 | 0.016 | 94.6 |
500 | \(\beta_2\) | 0.276 | 0.046 | 0.024 | 93.3 |
500 | \(\gamma_0\) | -0.024 | 0.162 | 0.024 | 94.4 |
500 | \(\gamma_1\) | 1.943 | 0.252 | 0.057 | 93.9 |
1000 | \(\beta_0\) | 0.997 | 0.037 | 0.003 | 95.0 |
1000 | \(\beta_1\) | 0.284 | 0.034 | 0.016 | 92.5 |
1000 | \(\beta_2\) | 0.281 | 0.032 | 0.019 | 92.6 |
1000 | \(\gamma_0\) | -0.016 | 0.115 | 0.016 | 94.4 |
1000 | \(\gamma_1\) | 1.932 | 0.175 | 0.068 | 91.5 |
10000 | \(\beta_0\) | 0.996 | 0.013 | 0.004 | 92.4 |
10000 | \(\beta_1\) | 0.283 | 0.010 | 0.017 | 63.3 |
10000 | \(\beta_2\) | 0.284 | 0.011 | 0.016 | 70.0 |
10000 | \(\gamma_0\) | -0.004 | 0.036 | 0.004 | 94.4 |
10000 | \(\gamma_1\) | 1.921 | 0.053 | 0.079 | 70.7 |
N | Parameter | \(\widehat{Q}\) | \(SD_{\widehat{Q}}\) | Bias | Coverage Rate |
---|---|---|---|---|---|
500 | \(\beta_0\) | 1.038 | 0.116 | -0.038 | 90.5 |
500 | \(\beta_1\) | 0.299 | 0.093 | 0.001 | 94.7 |
500 | \(\beta_2\) | 0.276 | 0.092 | 0.024 | 93.8 |
500 | \(\gamma_0\) | 0.079 | 0.261 | -0.079 | 90.5 |
500 | \(\gamma_1\) | 1.781 | 0.289 | 0.219 | 85.8 |
1000 | \(\beta_0\) | 1.030 | 0.083 | -0.030 | 91.6 |
1000 | \(\beta_1\) | 0.296 | 0.066 | 0.004 | 94.2 |
1000 | \(\beta_2\) | 0.281 | 0.064 | 0.019 | 93.6 |
1000 | \(\gamma_0\) | 0.080 | 0.184 | -0.080 | 91.3 |
1000 | \(\gamma_1\) | 1.773 | 0.201 | 0.227 | 82.4 |
10000 | \(\beta_0\) | 1.022 | 0.026 | -0.022 | 85.8 |
10000 | \(\beta_1\) | 0.292 | 0.021 | 0.008 | 92.3 |
10000 | \(\beta_2\) | 0.291 | 0.021 | 0.009 | 93.1 |
10000 | \(\gamma_0\) | 0.077 | 0.056 | -0.077 | 71.6 |
10000 | \(\gamma_1\) | 1.782 | 0.062 | 0.218 | 16.4 |
3.3 Summary and discussion
Kleinke & Reinecke (2013) used an imputation model with fitting distributional assumptions (i.e. ZIP imputation for ZIP data and ZINB imputation for ZINB data) and obtained unbiased statistical inferences. Here, we used the same data and imputed them using proxy methods like predictive mean matching or (ordered) polytomous regression. These strategies, however, yielded biased statistical infereces.
Results of this simulation once again stress the need to find an appropriate imputation model that has a sufficiently good fit to the data at hand. If the imputation model (like the normal linear regression model used by mice.impute.pmm
) is overly implausible, results will be biased.
Limitations and caveats
This study was based on simulated data, which were ZIP and ZINB distributed. Note that a ZIP or ZINB model will never have a perfect fit to real empirical data. Empirical researchers need to bear in mind that the quality of imputations will depend on how well one’s empirical data can be modeled by mathematically convenient models (like in this case a ZIP or ZINB model). Usually, the worse the model fit between empirical data and the assumed imputation model is, the less plausible imputations will be and the more bias is to be expected.