countimp version 2

Author

Kristian Kleinke and Jost Reinecke

Published

June 27, 2019

Modified

January 29, 2025

Abstract
Count data are non-negative integer values and give the frequency of occurrence of a certain event or behavior within a given timespan. Count data are usually not normally distributed but are often skewed and require special analysis and imputation techniques. Yet, most of the currently available multiple imputation packages are very limited with regard to count data. The countimp package provides easy to use multiple imputation (MI) procedures for incomplete count data based on either a Bayesian regression approach or on a bootstrap regression approach within a chained equations MI framework. Our software extends the functionality of the popular and powerful mice package in R (van Buuren & Groothuis-Oudshoorn, 2011). The current version of countimp supports ordinary count data imputation under the Poisson model, imputation of incomplete overdispersed count data under either the quasi-Poisson or the negative binomial model, imputation of zero-inflated ordinary or overdispersed count data based on a zero-inflated Poisson or negative binomial model, or a hurdle model, and imputation of multilevel count data based on generalized linear mixed effects count models (overdispersion and zero-inflation are supported). Additionally, we provide a predictive mean matching (PMM) variant, based on a two-level model, which might be used, when count data are not too heavily skewed (for an evaluation of flat-file count data imputation by PMM, see for example Kleinke, 2017).

1 Overview

Count data require special analysis techniques (for a comprehensive overview, see Hilbe, 2011): Ordinary count data can be analyzed by a classical Poisson regression model. Overdispersed count data, i.e., count data that have a larger variance in comparison to their mean are usually modeled by either quasi-Poisson or negative binomial (NB) models (for a distinction of true and apparent overdispersion, see Hilbe, 2011). Zero-inflated count data, i.e., count data that contain more zero counts than would be expected by either the Poisson, quasi-Poisson or NB model, can be analyzed by either zero-inflated Poisson or NB models, or by hurdle Poisson or NB models (see Zeileis et al., 2008 for an overview). Furthermore, generalized linear mixed effects versions of these models can be fit to cater for a clustered structure of the data.

Missing data methods and especially multiple imputation procedures for count data, however, are currently still very scarce. In practice, the missing data problem in count variables is often handled by (a) ignoring that the data are counts and by proceeding as if they were continuous, (b) by treating the data as (ordered) categorical. We regard these solutions as quick fixes (that may work in some settings), but not as an optimal and general solution to adequately analyze incomplete count data and get precise and unbiased parameter estimates as well as standard errors in a wide variety of scenarios. See for example Yu et al. (2007) or Kleinke (2017) for situations, where imputation models based on the normal model fail. Furthermore, see Chapter @ref(polyreg) for an evaluation of the strategy to treat count data as if they were ordered categorical.

Based on research by Yu et al. (2007) and Kleinke (2017), we recommend to use imputation methods with fitting parametrical assumptions.

Currently, the following packages (apart from countimp) allow to handle missing data based on various count models:

  • IVEware, which is available as a SAS add-on or a standalone version can create imputations based on an ordinary Poisson model.
  • R package mi supports flat-file count data imputation. Zero-inflation models, hurdle models, and multilevel count models are not supported.
  • ice for STATA supports count data imputation under a Poisson or NB model.

This document is structured as follows:

  • Section 2 gives information about missing data and multiple imputation in general, and familiarizes readers with typical count data models.
  • In Section 3, we illustrate by means of Monte Carlo simulation, why it may not be a good idea to treat counts as if they were either continuous or (ordered) categorical.
  • In Section 4, we describe package countimp in detail and provide practical examples as well as Monte Carlo simulations to assess both the quality of the newly introduced and the refurbished algorithms from Version 2 of package countimp.

An evaluation of the algorithms from Version 1 of package countimp may furthermore be found in:

  • Kleinke & Reinecke (2013)
  • Kleinke & Reinecke (2015a)
  • Kleinke & Reinecke (2015b)

What is new in version 2?

  • We have included support for both hurdle and zero-inflation models for both flat-file and multilevel data sets (only two-level models are currently supported).
  • We have replaced package glmmADMB with glmmTMB to fit the two-level NB, zero-inflation, and hurdle models, which in our experience runs more stable.
  • We have adapted our functions to the new mice architecture (countimp version 1 only works with mice version 2.35 or older).
  • We have included automatic handling of outliers in the imputed values (argument EV=TRUE). Note that extreme imputations might indicate that there are problems with the imputation method, model and / or the algorithm. If extreme imputations are detected, the respective imputations are replaced by imputations obtained by mice.impute.midastouch(), a predective mean matching (PMM) variant, described in Gaffert et al. (2016). PMM is usually a good allround imputation method that works in may scenarios (van Buuren, 2012).
  • We have included a two-level predictive mean matching algorithm. PMM can be enumerated among the hot deck procedures or k-nearest neighbour procedures and imputes an actual observed value. Based on predictions by a normal two-level linear mixed effects model, an observed case (donor) is matched to an incomplete case and the actual observed value replaces the missing one. By imputing only observed values, PMM is able (up to a certain degree) to preserve the original distribution of variable at hand and can buffer mild to moderate violations of model assumptions. On the downside, PMM cannot make any extrapolations beyond the observed range of the remaining cases and might not be a good method in situations, where the distribution of the observed and the assumed distribution of the unobserved cases are highly distinct. A description of the method, as well as results of a first Monte Carlo simulation may be obtained here.
  • all imputation functions are availabe als Bayesian regression and bootstrap regression variants (see Section 2.1.4). These are two different methods to introduce between-imputation variability. The flat-file bootstrap functions re-sample individual observations, the two-level functions re-sample clusters. Note that this requires a substantial level-2 sample size. If this is not the case, it is advisable to use the Bayesian variant instead. For further information, see the Monte Carlo simulation in Section 4.10.

2 Theoretical Background

2.1 A brief introduction to missing data and multiple imputation

Missing data pose a threat to the validity of statistical inferences, when they are (a) numerous, (b) not missing completely at random in the sense of Rubin (1976), and (c) when they are handled in an inadequate way for example by resorting to simple ad-hoc methods such as unconditional mean imputation (Schafer, 1997a). Multiple imputation (MI) is a state-of-the-art method to handle missing data (Schafer & Graham, 2002).

2.1.1 Multiple imputation in a nutshell

Since Rubin’s book (Rubin, 1987) – laying out the theoretical foundation – and Schafer’s software norm (MI based on the multivariate normal model, Schafer, 1997a), MI has become more and more popular and has become one of the standard procedures to handle missing data (Schafer & Graham, 2002). Today, different frameworks, methods, and models are availabe to create multiple imputations in many scenarios and contexts.

Multiple imputation procedures are generally preferable to single imputation, since single imputation methods typically do not produce unbiased standard error erstimates, unless some correction procedure is applied.

Multiple imputation following Rubin’s theory (Rubin, 1987) is a three-step-process: In the first step, each missing value is filled in \(m\) times based on an appropriate prediction model for the incomplete data given the observed data. In the second step, the \(m\) completed data sets are analysed separately by any complete data technique and thirdly, the \(m\) sets of model parameters and standard errors are integrated into an overall set of results using formula given in Rubin (1987) or Barnard & Rubin (1999). These combining rules integrate two variance components: variance within and between the imputed data sets. Variance between the imputed data sets is supposed to reflect the additional estimation uncertainty due to missing data in an adequate way.

Most MI methods today assume that the data are missing at random (MAR) in the sense of Rubin (1976), meaning that missingness can be predicted by observed information in the data set and does not depend on unobserved information. To make the MAR assumption more plausible in general, it is advisable to identify (and include into the imputation model) both variables that are well related to both the incomplete variables and their missingness indicators (see for example Collins et al., 2001; van Buuren & Groothuis-Oudshoorn, 2011).

2.1.2 Multiple imputation frameworks

The two common frameworks for creating multiple imputations are joint modeling (JM, e.g. Schafer, 1997a, 1997b) and conditional modeling (CM), also known as sequential regressions multiple imputation, fully conditional specification (FCS), or multiple imputation by chained equations (mice, van Buuren, 2012). In JM, a joint imputation model is specified for the data set as a whole, whereas in CM, separate conditional imputation models are specified for each incompletely observed variable in the data file. The mice software is based on the conditional modeling approach. The idea of CM is to use conditional distributions to model an underlying joint distribution and was originally developed to have an alternative “general purpose multivariate imputation procedure that can handle a relatively complex data structure where explicit full multivariate models cannot be easily formulated” (Raghunathan et al., 2001, p. 86), for example when the data set consists of variables of many different types (e.g. continuous, categorical, count).

2.1.3 The mice package

mice is an excellent R package to create multiple imputations (van Buuren & Groothuis-Oudshoorn, 2011). countimp extends the functionality of package mice by providing additional imputation functions for various kinds of parametric count models. mice has a couple of advantages we want to point out:

  • mice already comes with a great variety of imputation methods and models
  • mice allows to call user-written imputation functions, therefore its functionalities can be easily extended.
  • mice offers highly helpful tools to assess convergence of the imputation algorithm and to assess plausibility of the obtained imputations. For an overview, see van Buuren & Groothuis-Oudshoorn (2011), or van Buuren (2012).

Here, we focus only on the most important essentials to get novice users started. Imputations are created by the main mice() function, which has the following arguments:

library(mice)
args("mice")
function (data, m = 5, method = NULL, predictorMatrix, ignore = NULL, 
    where = NULL, blocks, visitSequence = NULL, formulas, blots = NULL, 
    post = NULL, defaultMethod = c("pmm", "logreg", "polyreg", 
        "polr"), maxit = 5, printFlag = TRUE, seed = NA, data.init = NULL, 
    ...) 
NULL

m sets the number of imputations (default is m = 5), maxit sets the maximum number of iterations for the Gibbs sampler.

Imputation methods are specified via argument method. A not exhaustive overview of imputation methods that are already implemented in mice is listed in Table 1.

Table 1: An overview of imputation methods in mice
Name Description Scale
pmm predictive mean matching numeric
norm Bayesian linear regression numeric
2l.norm 2-level linear mixed effects model numeric
logreg Bayesian logistic regression factor (2 levels)
polyreg polytomous regression factor (\(>2\) levels)
sample random sample from observed data any

The procedures are all described in detail in van Buuren & Groothuis-Oudshoorn (2011) and van Buuren (2012). Argument method must be a character vector of length equal to the number of variables in the data set, indicating the missing data procedures with which each variable is to be imputed. The first slot of method sets the imputation function for the first variable in the data file, and so on. If method is not specified the defaultMethod is used, which is pmm, predictive mean matching for continuous data, logreg, i.e. Bayesian logistic regression for factors with two levels and polyreg, polytomous logistic regression for factors with more than two levels. Completely observed variables have method "", indicating that they need not be imputed. The respective imputation functions are called mice.impute.name where name identifies the respective imputation function. For example, specifying "logreg" as imputation method for a variable calls function mice.impute.logreg().

library(mice)
mice.impute.logreg
function (y, ry, x, wy = NULL, ...) 
{
    if (is.null(wy)) 
        wy <- !ry
    aug <- augment(y, ry, x, wy)
    x <- aug$x
    y <- aug$y
    ry <- aug$ry
    wy <- aug$wy
    w <- aug$w
    x <- cbind(1, as.matrix(x))
    expr <- expression(glm.fit(x = x[ry, , drop = FALSE], y = y[ry], 
        family = quasibinomial(link = logit), weights = w[ry]))
    fit <- eval(expr)
    fit.sum <- summary.glm(fit)
    beta <- coef(fit)
    rv <- t(chol(sym(fit.sum$cov.unscaled)))
    beta.star <- beta + rv %*% rnorm(ncol(rv))
    p <- 1/(1 + exp(-(x[wy, , drop = FALSE] %*% beta.star)))
    vec <- (runif(nrow(p)) <= p)
    vec[vec] <- 1
    if (is.factor(y)) {
        vec <- factor(vec, c(0, 1), levels(y))
    }
    vec
}
<bytecode: 0x11ba2fc80>
<environment: namespace:mice>

For details about the respective arguments, see help("mice.impute.logreg").

Imputation models are defined via the predictorMatrix argument. predictorMatrix must be a rectangular matrix of dimensions equal to the number of variables in the data set. An example is given below:

library(countimp)
data(crim4w)
quickpred(crim4w)
       id FEMALE RE GY HA ACRIM BCRIM CCRIM DCRIM
id      0      0  0  0  0     0     0     0     0
FEMALE  0      0  0  0  0     0     0     0     0
RE      0      0  0  0  0     0     0     0     0
GY      0      0  0  0  0     0     0     0     0
HA      0      0  0  0  0     0     0     0     0
ACRIM   1      1  0  1  1     0     1     1     1
BCRIM   1      1  0  1  1     1     0     1     1
CCRIM   1      1  0  1  1     1     1     0     1
DCRIM   1      1  0  1  1     1     1     1     0

Each row \(i\) in that matrix gives the imputation model of variable \(i\) in the data file. The zeros and ones indicate (0 = no; 1 = yes), if the respective variable in column \(j\) is used to predict missing data in variable \(i\). Using the information from the predictorMatrix, mice automatically creates three objects that are passed on to the respective mice.impute.name sub-function: y, x and ry. y is an incomplete data vector of length \(n\), the variable to be imputed. x is an \(n \times p\) matrix of predictors, those variables that were specified via the respective row in the predictorMatrix. ry is the response indicator of vector y, indicating if a value in y has been observed (TRUE), or not (FALSE). For descriptions of further arguments, see help("mice").

2.1.4 Bayesian regression and bootstrap regression

Regression based imputation methods can be classified as either deterministic or stochastic, depending on whether or not there is some degree of randomness in the imputation process. A deterministic imputation approach would impute the value that is predicted for the respective incomplete case. Such approaches usually cause an underestimation of standard errors, since all imputed values lie directly on the regression line.

Multiple imputation based on Rubin’s theory (Rubin, 1987) includes a stochastic component, that introduces an adequate amount of between-imputation variability and to reflect the estimation uncertainty due to missing data.

Two common solutions in CM are Bayesian regression (see for example Rubin, 1987, pp. 169–170), and bootstrap regression: Bayesian regression first fits a model to the complete part of the data and computes the posterior mean \(\hat{\beta}\) and posterior variance \(V(\hat{\beta})\) of the model parameters \(\beta\). Then, new parameters are simulated from \(\mathcal{N}(\hat{\beta},V(\hat{\beta}))\). These new parameters are then used to make predictions for the incomplete part of the data and to obtain the imputations.

The bootstrap variant draws a bootstrap sample from the observed cases, fits the regression model to the bootstrap sample and uses the obtained model parameters to make predictions for the incomplete part of the data.

Imputation functions from package countimp are available as Bayesian regression and bootstrap regression variants.

2.2 A brief introduction to count data models

A good introduction and summary may be found in Zeileis et al. (2008). A very comprehensive text is Hilbe (2011). The standard procedure to analyze ordinary count data is to fit a Poisson model under the generalized linear modeling (GLM) framework. GLMs describe the dependence of a scalar variable \(y_i\) on a set of regressors \(x_i\). The conditional distribution of \(y_i\vert x_i\) is a linear exponential family with probability density

\[\begin{equation} f(y;\lambda,\delta) = exp\left( \frac{y\lambda-b(\lambda)}{\delta}+c(y,\delta) \right), \end{equation}\]

with \(\lambda\) depending on \(x_i\) via a linear predictor, \(\delta\) being a dispersion parameter, and \(b(\cdot)\) and \(c(\cdot)\) being functions that determine, which member of the family (e.g., Poisson) is used.

The classical poisson model

\[\begin{equation} f(y;\mu)=\frac{exp(-\mu)\mu^y}{y!} \end{equation}\] with link function \(g(\mu)=log(\mu)\) assumes that the variance \(V(\mu)\) is equal to the mean \(\mu\). In Poisson regression, the probability of observing a certain count \(y_i\) given covariates \(x_i\) is

\[\begin{equation*} P(Y_i=y_i\vert x_i)=\frac{e^{-\mu_i}\mu_i^{y_i}}{y_i!}, \quad y_i=0,1,2,\dots. \end{equation*}\]

The conditional mean is \(E(y_i\vert x_i)=\mu_i=e^{x_i^\prime\beta}\), the variance is assumed to be equal to the mean.

A problem that often arises when analysing empirical data is that the equidispersion assumption – that the variance has to be equal to the mean – is violated. Very often empirical data are overdispersed, which means that the variance is larger in comparison to the mean. Analyzing overdispersed data using classical Poisson regression could lead to an underestimation of the variation in the data and an overestimation of statistical significance (cf. Zeileis et al., 2008). Models that allow for overdispersion are the quasi-Poisson or the negative binomial model.

The quasi-Poisson model is identical to the Poisson model, except that it estimates dispersion from the data. Another solution to address overdispersion in the data is to fit a negative binomial (NB) model – a gamma Poisson mixture model. There are a number of different parametrizations of the NB models (for details, see Hilbe, 2011). Note that the NB model and the Poisson model are nested and that the NB distribution converges to a Poisson distribution, when overdispersion \(\rightarrow 0\).

An excess of zero counts can be addressed by fitting a zero-inflated Poisson or negative binomial model (Lambert, 1992) or by fitting a hurdle model (Mullahy, 1986). countimp supports both options. Both model classes are mixture models and contain a second model component that addresses the excess zeros in the empirical data: Hurdle models combine a zero-truncated count component (a zero-truncated Poisson or NB model) and a hurdle component (usually a binomial GLM). Zero-inflation models combine a point mass at zero and the count component (a Poisson or NB model). The result of a Bernoulli trial determines, which process is used.

Different sets of covariates can be specified in each model component – since in practice it is often the case that the zero process and the count process is determined by different sets of predictors.

The main difference between zero-inflation models and hurdle models is that the latter allow two possible sources of zeros. Zeros may stem from either the count part or the zero part of the model.

3 Why we should not resort to proxy solutions to impute heavily skewed count data

Previous missing data research discourages the use of imputation models that do not have a good fit to the distribution of the empirical data at hand (Kleinke, 2017; Yu et al., 2007). When data are zero-inflated counts (which are typically heavily skewed), an appropriate imputation model (and analysis model) would be a zero-inflation model or a hurdle model (Hilbe, 2011; Lambert, 1992; Mullahy, 1986). Apart from using adequate count data imputation models – which until recently were not yet available, proxy strategies to impute these kinds of data include:

  • to ignore the fact that count data are (often skewed) non-negative integer values, and to use standard procedures for example based on OLS regression,
  • to apply some normalizing transformation like a log or square root transformation, followed by normal model multiple imputation and backtransformation to the original scale afterwards, or
  • to treat the data as ordered categorical and use an ordinal logistic regression imputation model.

van Buuren (2012), Chapter 3.6.1 furthermore recommends the following strategies to impute count data:

  • predictive mean matching
  • ordered categorical regression
  • (zero-inflated) Poisson regression
  • (zero-inflated) negative binomial regression.

The transformation – imputation – backtransformation approach has been implemented in Schafer’s norm for Windows software from 1999 (which is unfortunately no longer available). Back then, norm was one of the very few user friendly software solutions available. Transforming non-normal variables before the imputation to make the normality assumption of the imputation model more plausible, was one of the few options practitioners had at that time to handle missing data in non-normal data. Today, missing data researchers (e.g. von Hippel, 2013) usually discourage the use of transformations to make the normality assumption of an imputation model more plausible. Usually, better alternatives are available. Regarding ordered categorical regression, we are currently not aware of any systematic evaluations.

In this section, we would like to discuss these recommendations from 2012 in the light of current missing data research, and want to corroborate our points by a Monte Carlo simulation. van Buuren (2012) writes that it is not ``yet clear whether one of the four methods consistently outperforms the others. Until more detailed results become available, my advice is to use predictive mean matching for count data’’ (p. 78).

Firstly, we absolutely agree that more systematic simulation studies are needed before we can provide sound practical guidelines regarding which method is the best for a given scenario.

Furthermore, we also agree that predictive mean matching is a good allround method that can be recommended for many scenarios including count data that are not too severely skewed. Kleinke (2017), for example, who has simulated incomplete Poisson distributed data, has shown that PMM yielded acceptable results when skewness was only mild to moderate and when not too many data had to be imputed. However, inferences were biased, when count data were more heavily skewed and the missing data percentage was substantial.

Since zero-inflated data are usually heavily skewed, we assume that PMM might not be an option for the imputation of zero-inflated count data.

Secondly, as already mentioned, we are not aware of systematic simulations that have tested how appropriate ordered categorical imputation is for zero-inflated Poisson or negative binomial data. Generally speaking, (ordered) categorical imputation could be feasible, when the number of categories is not too large (otherwise, one might run into estimation problems / empty cell problems) (van Buuren & Groothuis-Oudshoorn, 2011). van Buuren & Groothuis-Oudshoorn (2011) mention that imputation of variables with more than 20 categories are often problematic. Therefore, if the range of the count variable lies between 0 and 20, using mice.impute.polr to create the imputations might be an option. In our simulation, we wanted to test, if this is an appropriate strategy for zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) data.

Thirdly, Kleinke & Reinecke (2013) have demonstrated that MI based on a ZIP or ZINB imputation model produced acceptable results when the data were in fact zero-inflated Poisson or zero-inflated negative binomial respectively. However,we did not compare ZIP and ZINB imputation against predictive mean matching or ordered categorical imputation.

We therefore want

  1. to explore, if PMM can be safely used for the imputation of ZIP and ZINB data,
  2. to evaluate the appropriateness of ordered categorical imputation in these scenarios, and
  3. to compare the results of these strategies against results based on ZIP and ZINB imputation models.

To these ends, we re-analyse the data from Kleinke & Reinecke (2013).

3.1 Method

Details about the simulation set-up are given in Kleinke & Reinecke (2013). Simulated data sets contained four variables. \(y\) was the dependent incomplete zero-inflated count variable, which – depending on the respective condition – was either ZIP or ZINB. Data sets also included three continuous predictors: \(x_1\), \(x_2\), and \(z_1\), with \(x_1\) and \(x_2\) being the covariates in the count part of the model, and \(z\) the covariate in the zero-inflation part. Population parameters were set to \(\beta_0=1\), \(\beta_1=.3\), \(\beta_2=.3\), \(\gamma_0=0\), \(\gamma_1=2\), with \(\beta\) being the parameters in the count part and \(\gamma\) the parameter in the zero part of the model. In the ZINB conditions, the dispersion parameter was set to 1. For both the ZIP and ZINB conditions 1000 data sets with sample sizes \(N\in\{500,1000,10000\}\) were simulated respectively. MAR missingness in \(y\) was introduced as outlined in Kleinke & Reinecke (2013). Missing data were imputed using functions mice.impute.pmm (predictive mean matching) and mice.impute.polr (ordered polytomous regression). Repeated data analysis and pooling of results was done as described in Kleinke & Reinecke (2013).

3.2 Results

PMM results are given in Table 2 and Table 3. The tables display the average estimate of the respective parameter \(\widehat{Q}\) across the 1000 replications, its standard deviation, bias, which is defined as the average absolute distance between the simulated true parameter and its estimate across the 1000 replications, and 95% coverage rates, the percentage of intervals that include the true parameter. Obviously, \(\widehat{Q}\) should be close to true parameters \(Q\). Furthermore, the standard deviation of the estimates across the replications should be small (which in combination with an accurate \(\widehat{Q}\) reflects a consistently good estimation across the replications). Finally, 95% coverage should be close to 95%. Schafer & Graham (2002) deem values below 90% as serious undercoverage.

Table 2: Performance of predictive mean matching MI when data are ZIP
N Parameter \(\widehat{Q}\) \(SD_{\widehat{Q}}\) Bias Coverage Rate
500 \(\beta_0\) 0.944 0.062 0.056 87.4
500 \(\beta_1\) 0.240 0.052 0.060 79.2
500 \(\beta_2\) 0.249 0.051 0.051 86.7
500 \(\gamma_0\) -0.183 0.172 0.183 86.8
500 \(\gamma_1\) 1.719 0.235 0.281 79.1
1000 \(\beta_0\) 0.947 0.042 0.053 80.9
1000 \(\beta_1\) 0.243 0.037 0.057 67.2
1000 \(\beta_2\) 0.252 0.036 0.048 77.8
1000 \(\gamma_0\) -0.185 0.121 0.185 74.6
1000 \(\gamma_1\) 1.707 0.165 0.293 63.3
10000 \(\beta_0\) 0.946 0.014 0.054 2.8
10000 \(\beta_1\) 0.244 0.011 0.056 0.7
10000 \(\beta_2\) 0.250 0.011 0.050 2.3
10000 \(\gamma_0\) -0.191 0.038 0.191 0.3
10000 \(\gamma_1\) 1.699 0.052 0.302 0.2
Table 3: Performance of predictive mean matching MI when data are ZINB
N Parameter \(\widehat{Q}\) \(SD_{\widehat{Q}}\) Bias Coverage Rate
500 \(\beta_0\) 0.938 0.121 0.062 92.6
500 \(\beta_1\) 0.265 0.096 0.035 93.8
500 \(\beta_2\) 0.271 0.099 0.029 93.9
500 \(\gamma_0\) -0.195 0.300 0.195 96.0
500 \(\gamma_1\) 1.729 0.326 0.271 84.9
1000 \(\beta_0\) 0.939 0.085 0.061 91.5
1000 \(\beta_1\) 0.266 0.068 0.034 92.2
1000 \(\beta_2\) 0.270 0.067 0.030 95.2
1000 \(\gamma_0\) -0.194 0.211 0.194 92.0
1000 \(\gamma_1\) 1.709 0.220 0.291 77.7
10000 \(\beta_0\) 0.939 0.026 0.061 42.1
10000 \(\beta_1\) 0.269 0.022 0.031 69.8
10000 \(\beta_2\) 0.271 0.021 0.029 75.8
10000 \(\gamma_0\) -0.196 0.065 0.196 20.6
10000 \(\gamma_1\) 1.694 0.066 0.306 3.0

When we first look at results of the ZIP conditions, we see that estimates \(\widehat{Q}\) are very similar – regardless of sample size. However, both parameters in the zero model and in the count model are underestimated.

Furthermore, note that coverage rates decrease dramatically with increasing sample size. This is because standard error estimates depend on sample size. An increasing sample size leads to a smaller standard error estimate. Confidence intervals therefore become narrower, and even samller biases could result in significant undercoverage (i.e. less intervals include the true parameter).

Let us turn to the ZINB conditions: Here, biases are especially noticeable in the zero part of the model. Though parameters of the count part of the model are also underestimated (especially \(\beta_1\)), corresponding coverage rates are acceptably large, when \(N=500\) or \(N=1000\). In the large sample size condition, coverage, however, falls below 90% for all parameters, and, in the worst case, droppes down to 3% for parameter \(\gamma_1\).

Results of polytomous regression (function mice.impute.polr) are given in Table 4 and Table 5. Coefficients of both the zero and the count part of the model are usually underestimated and coverage rates of most model parameters are usually unacceptably low regardless of sample size.

We conclude that – in this scenario – where data were zero-inflated Poisson or negative binonial and severely skewed, both predictive mean matching and ordered polytomous regression produced inadequate results and biased statistical inferences. When we compare results to the ones reported in Kleinke & Reinecke (2013), we see that using an imputation model with fitting distributional assumptions is highly recommendable.

Table 4: Performance of polytomous regression MI when data are ZIP
N Parameter \(\widehat{Q}\) \(SD_{\widehat{Q}}\) Bias Coverage Rate
500 \(\beta_0\) 0.997 0.055 0.003 95.0
500 \(\beta_1\) 0.284 0.046 0.016 94.6
500 \(\beta_2\) 0.276 0.046 0.024 93.3
500 \(\gamma_0\) -0.024 0.162 0.024 94.4
500 \(\gamma_1\) 1.943 0.252 0.057 93.9
1000 \(\beta_0\) 0.997 0.037 0.003 95.0
1000 \(\beta_1\) 0.284 0.034 0.016 92.5
1000 \(\beta_2\) 0.281 0.032 0.019 92.6
1000 \(\gamma_0\) -0.016 0.115 0.016 94.4
1000 \(\gamma_1\) 1.932 0.175 0.068 91.5
10000 \(\beta_0\) 0.996 0.013 0.004 92.4
10000 \(\beta_1\) 0.283 0.010 0.017 63.3
10000 \(\beta_2\) 0.284 0.011 0.016 70.0
10000 \(\gamma_0\) -0.004 0.036 0.004 94.4
10000 \(\gamma_1\) 1.921 0.053 0.079 70.7
Table 5: Performance of polytomous regression MI when data are ZINB
N Parameter \(\widehat{Q}\) \(SD_{\widehat{Q}}\) Bias Coverage Rate
500 \(\beta_0\) 1.038 0.116 -0.038 90.5
500 \(\beta_1\) 0.299 0.093 0.001 94.7
500 \(\beta_2\) 0.276 0.092 0.024 93.8
500 \(\gamma_0\) 0.079 0.261 -0.079 90.5
500 \(\gamma_1\) 1.781 0.289 0.219 85.8
1000 \(\beta_0\) 1.030 0.083 -0.030 91.6
1000 \(\beta_1\) 0.296 0.066 0.004 94.2
1000 \(\beta_2\) 0.281 0.064 0.019 93.6
1000 \(\gamma_0\) 0.080 0.184 -0.080 91.3
1000 \(\gamma_1\) 1.773 0.201 0.227 82.4
10000 \(\beta_0\) 1.022 0.026 -0.022 85.8
10000 \(\beta_1\) 0.292 0.021 0.008 92.3
10000 \(\beta_2\) 0.291 0.021 0.009 93.1
10000 \(\gamma_0\) 0.077 0.056 -0.077 71.6
10000 \(\gamma_1\) 1.782 0.062 0.218 16.4

3.3 Summary and discussion

Kleinke & Reinecke (2013) used an imputation model with fitting distributional assumptions (i.e. ZIP imputation for ZIP data and ZINB imputation for ZINB data) and obtained unbiased statistical inferences. Here, we used the same data and imputed them using proxy methods like predictive mean matching or (ordered) polytomous regression. These strategies, however, yielded biased statistical infereces.

Results of this simulation once again stress the need to find an appropriate imputation model that has a sufficiently good fit to the data at hand. If the imputation model (like the normal linear regression model used by mice.impute.pmm) is overly implausible, results will be biased.

Limitations and caveats

This study was based on simulated data, which were ZIP and ZINB distributed. Note that a ZIP or ZINB model will never have a perfect fit to real empirical data. Empirical researchers need to bear in mind that the quality of imputations will depend on how well one’s empirical data can be modeled by mathematically convenient models (like in this case a ZIP or ZINB model). Usually, the worse the model fit between empirical data and the assumed imputation model is, the less plausible imputations will be and the more bias is to be expected.

4 countimp

4.1 A cautionary note

Package countimp creates imputations based on parametric count models. Plausibilty of the obtained imputations will depend upon the correct specification of the imputation procedure, i.e. the count data regression model that is used to predict missing information must fit the empirical data well.

Research by Kleinke & Reinecke (2013) suggests that misspecifications of the (count) imputation model can bias statistical inferences quite noticeably. We therefore advise researchers to spend sufficient time on modelling. We would like to refer readers to the excellent book by Hilbe (2011) to determine which count model fits the data best.

We would also like to draw attention to R packages Qtools and ImputeRobust, which allow to create imputations based on nonparametric quantile regression and semi-parametric generalized additive models, which might be a suitable alternative in situations, where assumptions of parametric count models are violated.

4.2 Count data models

An overview of parametric imputation models available in package countimp is given in Table 6. More details on the theoretical background of count regression models is given in Zeileis et al. (2008), and in Hilbe (2011).

Table 6: An overview of imputation methods and models in package countimp
Function Name model / family
mice.impute.poisson() Poisson
mice.impute.quasipoisson() Quasi-Poisson
mice.impute.nb() negative binomial
mice.impute.zip() zero-inflated Poisson
mice.impute.zinb() zero-inflated negative binomial
mice.impute.hp() hurdle Poisson
mice.impute.hnb() hurdle negative binomial
mice.impute.2l.poisson() two-level Poisson
mice.impute.2l.nb() two-level negative binomial
mice.impute.2l.zip() two-level zero-inflated Poisson
mice.impute.2l.zinb() two-level zero-inflated NB
mice.impute.2l.hp() two-level hurdle Poisson
mice.impute.2l.hnb() two-level hurdle negative binomial

4.3 Installation

The latest version can be installed from GitHub using function install_github from package `remotes``:

remotes::install_github('kkleinke/countimp')

4.4 Imputation of incomplete Poisson distributed variables

The function to impute an incomplete Poisson distributed variable is called mice.impute.poisson(). Appending .boot to the function name calls the bootstrap regression variant (see help("mice.impute.poisson.boot")). The functions have the following arguments:

library(countimp)
args("mice.impute.poisson")
function (y, ry, x, wy = NULL, EV = TRUE, ...) 
NULL

Arguments y, ry, x need not be specified. They are automatically obtained from mice. For information on how to use mice’s new wy-argument, and for further information, see help("mice").

Numeric vector y is the incomplete Poisson distributed variable that shall be imputed. ry is the response indicator of y, with TRUE indicating an observed value and FALSE a missing observation in vector y respectively. wy is a logical vector of length(y). TRUE statements indicate locations in y for which imputations shall be generated (default is !ry). x is a matrix with length(y) rows containing complete covariates – the variables in the imputation model, which has been specified via mice’s predictorMatrix argument. The variables in x are used to predict missing information in y. Argument EV defines how the function handles extreme imputations. When set to TRUE, the function automatically detects extreme values among the imputations and replaces them by values obtained by predictive mean matching.

We use the standard glm.fit function from package stats to fit the Poisson model to the observed part of the data, i.e. x[ry,], y[ry]. Imputations are drawn from a Poisson distribution with mean \(\lambda\), which is the value predicted for the respective incomplete case.

4.4.1 Example

We generate an example data set with a Poisson distributed dependent variable \(y\) and three continuous predictors \(x1\)\(x3\) and introduce MAR missingness in the dependent variable.

set.seed( 1234 )
b0 <- 1
b1 <- .75
b2 <- -.25
b3 <- .5
N <- 5000
x1 <- rnorm(N)
x2 <- rnorm(N)
x3 <- rnorm(N)
lam <- exp( b0 + b1 * x1 + b2 * x2 + b3 * x3 )
y <- rpois( N, lam )
POIS <- data.frame( y, x1, x2, x3 )

## introduce MAR missingness to simulated data
generate.md <- function( data, pos = 1, Z = 2, pmis = .5, strength = c( .5, .5 ) ) 
{
 total <- round( pmis * nrow(data) )
 sm <- which( data[,Z] < mean( data[,Z] ) )
 gr <- which( data[,Z] > mean( data[,Z] ) )
 sel.sm <- sample( sm, round( strength[1] * total ) )
 sel.gr <- sample( gr, round( strength[2] * total ) )
 sel <- c( sel.sm, sel.gr )
 data[sel,pos] <- NA
 return(data)
}
MPOIS <- generate.md( POIS, pmis = .2, strength = c( .2, .8) )

This procedure generates 20% missing data in y. Missingness depends on variable x1. Cases whose x1-value is above the mean of x1 are more likely to have a missing value in y. The resulting incomplete data set is called MPOIS.

Imputations are then generated using wrapper function countimp(), which passes the arguments on to mice.

To select function mice.impute.poisson() to impute missing data in \(y\), the respective slot in method has to be set to "poisson". See help("mice") for details.

colnames(MPOIS)
[1] "y"  "x1" "x2" "x3"

MPOIS has four variables. Thus, method has to be a vector of length 4. y is the first variable in MPOIS. We therefore need to set the first entry in method to "poisson" (or "poisson.boot" for the bootstrap variant). The other variables are completely observed and get method "".

imp <- countimp( MPOIS, method = c( "poisson" ,"" ,"" ,"" ), print=FALSE)
impb <- countimp( MPOIS, method = c( "poisson.boot" ,"" ,"" ,"" ), print=FALSE)

Note that if predictorMatrix is not specified, all other variables are used to predict missing information in y.

The resulting object imp is of class mids (a multiply imputed data set). Standard mice functions like with() and pool() can be applied to that object to automate repeated data analysis and pooling of results.

repeated.analysis <- with(imp, glm(y~x1+x2+x3,family = poisson))
summary(pool(repeated.analysis))
         term   estimate   std.error statistic        df       p.value
1 (Intercept)  0.9747960 0.010307131  94.57491 204.47467 9.656761e-171
2          x1  0.7549316 0.008231517  91.71233  78.34325  1.761782e-81
3          x2 -0.2349946 0.007524757 -31.22953 454.07727 3.870181e-115
4          x3  0.4947964 0.008345951  59.28580  46.64120  1.440859e-45

For details about the respective slots of the output, see van Buuren & Groothuis-Oudshoorn (2011).

4.5 Imputation of overdispersed count data

countimp has two methods to impute overdispersed count:

  1. imputation based on a quasi-Poisson model, and
  2. imputation based on a negative binomial model.

Both methods can be used when the equidispersion assumption of the standard Poisson model is violated, i.e. when the variance of the count variable is larger in comparison to its mean.

4.5.1 Imputation based on a quasi-Poisson model

Method "quasipoisson" imputes univariate missing data based on a Bayesian quasi-Poisson GLM using the standard R function glm.fit (i.e. using family = quasipoisson(link = log)). Imputations are simulated from a negative binomial distribution to ensure an adequate dispersion of imputed values: The quasi-Poisson model estimates Pearson dispersion \(\delta\) that gives the amount of overdispersion in the data. Pearson dispersion \(\delta\) is stored in slot $dispersion of the output of the summary.glm function. We supply this parameter to the rnbinom() function from package stats to obtain the imputations. For details about the parametrization of the NB distribution used by rnbinom, see help("rnbinom").

4.5.2 Example

We simulate an overdispersed count variable y and generate about 20% missing data in this variable. Imputation, repeated data analysis and pooling of results works analogous to Section 4.4.1.

## simulate overdispersed count data
set.seed( 1234 )
b0 <- 1
b1 <- .75
b2 <- -.25
b3 <- .5
N <- 5000
x1 <- rnorm(N)
x2 <- rnorm(N)
x3 <- rnorm(N)
mu <- exp( b0 + b1 * x1 + b2 * x2 + b3 * x3 )
y <- MASS::rnegbin( N, theta = 2, mu )
NB <- data.frame( y, x1, x2, x3 )

## introduce MAR missingness to simulated data
total <- round( .2 * N )  ##number of missing data in y
sm <- which( NB[,2] < mean( NB[,2] ) )  ##subset: cases with x2<mean(x2)
gr <- which( NB[,2] > mean( NB[,2] ) )  ##subset: cases with x2>mean(x2)
sel.sm <- sample( sm, round( .2 * total ) ) ##select cases to set as missing
sel.gr <- sample( gr, round( .8 * total ) ) ##select cases to set as missing
sel <- c( sel.sm,sel.gr )
MNB <- NB
MNB[sel,1] <- NA    ##delete selected data

## impute missing data
imp <- countimp( MNB, method = c( "quasipoisson", "", "", "" ), print = FALSE)

## repeated data analysis and pooling of results
repeated.analysis <- with(imp, glm(y~x1+x2+x3,family=quasipoisson))
summary(pool(repeated.analysis))
         term   estimate  std.error statistic        df      p.value
1 (Intercept)  1.0153314 0.01737996  58.41966 134.06310 3.093930e-97
2          x1  0.7217358 0.01571253  45.93377  25.94808 2.196939e-26
3          x2 -0.2352151 0.01839210 -12.78892  12.52307 1.478936e-08
4          x3  0.4569815 0.01463833  31.21814  34.53985 6.693467e-27

4.6 Imputation based on a negative binomial model

Method "nb" fits a negative binomial model using function glm.nb() from package MASS and simulates the imputations from a negative binomial distribution, using function rnegbin() from package MASS.

The imputation process works analogous to Section @ref(sec:example).

4.7 Imputation of count data with an excess number of zero counts

Methods zip and zinb impute missing data based on a zero-inflated Poisson or NB model respectively (for details, see Kleinke & Reinecke, 2013). Methods hp and hnb use a hurdle Poisson or NB model.

Users can specify different sets of predictors for the zero and count parts of the models. Allowed entries in the predictorMatrix are listed in Table Table 7.

Table 7: Allowed entries in the predictorMatrix
numeric code meaning
0 variable is not included in either model
1 variables is included in both the zero and the count model
2 count model predictor
3 zero-model predictor.

Note that the current mice version does not allow entries in the predictorMatrix apart from 0 and 1, unless the model is a multilevel model. Two-level imputation models in mice have prefix .2l. Kleinke & Reinecke (2013) therefore used this workaround and labeled the functions 2l.zip and 2l.zinb – which were included in countimp version 1 to allow the specification of different sets of predictors for the zero and count parts of the model via the predictorMatrix. This however lead to some confusion among empirical researchsers, since these functions did not fit two-level models.

Caution: In version 2 of package countimp, methods 2l.zip and 2l.zinb in fact fit two-level models – see Section 4.10.

In version 2 of package countimp, we have found a better solution. Wrapper function countimp takes care of the problem. It is therefore important that users call countimp() rather than mice() to create the imputations, when the imputation model is either a zero-inflation model or a hurdle model. countimp() takes the same arguments as mice() and passes them on to mice. Imputations are then created by mice.

4.7.1 Monte Carlo Simulation

Kleinke & Reinecke (2013) have tested the zero-inflation functions (methods "2l.zip" and "2l.zinb") in various scenarios and found that the methods performed adequately well, when the empricial data were in fact zero-inflated Poisson or negative binomial. Here, we use a highly similar set-up to evaluate the new hurdle model variants of these functions, which are availabe in version 2 of the countimp package.

Note that unlike zero-inflation models, hurdle models allow only one source of zeros. The zero model determines, if case \(i\) has a zero or non-zero observation and the count model determines, which non-zero observation case \(i\) has. Hurdle models assume that the residuals of the hurdle component (here the zero model) and the outcome model (here the count model) are uncorrelated. This assumption is plausible if it is quite distinct groups who form the zero versus non-zero category.

An example, where a hurdle model might be an appropriate analyis model, is the count of cigarettes smoked per day. Non-smokers will quite certainly have a zero count, while smokers will vary in their number of cigarettes smoked per day. Furthermore, there might be different sets of variables that will predict, if persons are smokers or non-smokers, while other variables will predict, how many cigarettes a smoker will consume per day.

To test the new hurdle model imputation functions, we created 1000 data sets with sample sizes \(N\in\{500,1000,10000\}\) respectively that contained four variables: \(y\) the dependent count variable that has an excess number of zeros, and continuous predictors \(x1\), \(x2\), and \(x3\).

\(x_1\)\(x_3\) were drawn from normal \(\mathcal{N}(0,1)\) populations. \(\mu_{0i}\) was the individual probability of belonging to the count versus the zero-group and was determined by a logit model

\[\begin{equation} \mu_{0i}=\text{invlogit}(2\cdot x_3) \end{equation}\]

The respective observation in \(y\) was drawn from a binomial distribution with size 1 and probabiliy \(\mu_{0i}\). Finally, a “1” in \(y\) was replaced by a draw from a zero-truncated Poisson distribution with mean \(\exp(1+.3\cdot x_{1i}+.3\cdot x_{2i})\) in the Poisson conditions or from a zero-truncated negative binomial distribution with mean \(\exp(1+.3\cdot x_{1i}+.3\cdot x_{2i})\) and overdispersion parameter \(\theta=1\) in the negative binomial conditions. MAR missingness was then introduced as outlined in Kleinke & Reinecke (2013). Missing data were imputed \(m=5\) times using both the Bayesian regression and the bootstrap regression variants of our hurdle model imputation functions. Function hurdle from package pscl was used to fit the corresponding hurdle model to the completed data sets. Results of the Poisson conditions are displayed in Table 8 and Table 9. \(\beta\) are the coefficients in the count part of the model, \(\gamma\) the coefficients in the zero model. \(Q\) are the simulated true population quantities, \(\widehat{Q}\) the Monte Carlo estimates, \(SD_{\widehat{Q}}\) is the standard deviation of the estimates across the 1000 replications. We also report %Bias and 95% coverage rates. As can be seen in Table 8 and Table 9, estimates are always very close to the true population quantities, bias is very small and coverage is adequate for all parameters regardless of sample size.

Table 8: Performance of MI based on a Bayesian regression hurdle Poisson model
N Parameter \(\beta_0\) \(\beta_1\) \(\beta_2\) \(\gamma_0\) \(\gamma_1\)
500 \(\widehat{Q}\) 0.996 0.295 0.294 -0.005 2.028
500 \(SD_{\widehat{Q}}\) 0.052 0.049 0.050 0.145 0.228
500 Bias 0.004 0.005 0.006 0.005 -0.028
500 Coverage Rate 96.700 95.300 95.800 95.300 94.900
1000 \(\widehat{Q}\) 0.997 0.297 0.296 -0.003 2.014
1000 \(SD_{\widehat{Q}}\) 0.039 0.034 0.036 0.101 0.163
1000 Bias 0.003 0.003 0.004 0.003 -0.014
1000 Coverage Rate 94.600 95.200 95.200 95.800 95.400
10000 \(\widehat{Q}\) 1.000 0.299 0.300 -0.001 2.004
10000 \(SD_{\widehat{Q}}\) 0.012 0.011 0.011 0.032 0.048
10000 Bias 0.000 0.001 0.000 0.001 -0.004
10000 Coverage Rate 94.400 94.000 96.100 95.200 94.900
Table 9: Performance of MI based on a bootstrap regression hurdle Poisson model
N Parameter \(\beta_0\) \(\beta_1\) \(\beta_2\) \(\gamma_0\) \(\gamma_1\)
500 \(\widehat{Q}\) 0.994 0.293 0.293 -0.007 2.037
500 \(SD_{\widehat{Q}}\) 0.052 0.049 0.049 0.147 0.234
500 Bias 0.006 0.007 0.007 0.007 -0.037
500 Coverage Rate 95.400 95.300 96.200 95.100 95.000
1000 \(\widehat{Q}\) 0.996 0.295 0.295 -0.005 2.017
1000 \(SD_{\widehat{Q}}\) 0.039 0.034 0.035 0.101 0.164
1000 Bias 0.004 0.005 0.005 0.005 -0.017
1000 Coverage Rate 94.600 95.200 95.100 95.500 93.800
10000 \(\widehat{Q}\) 1.000 0.299 0.300 -0.001 2.004
10000 \(SD_{\widehat{Q}}\) 0.012 0.011 0.011 0.032 0.049
10000 Bias 0.000 0.001 0.000 0.001 -0.004
10000 Coverage Rate 93.900 94.100 94.400 94.200 95.300
Table 10: Performance of MI based on a Bayesian regression hurdle NB model
N Parameter \(\beta_0\) \(\beta_1\) \(\beta_2\) \(\gamma_0\) \(\gamma_1\)
500 \(\widehat{Q}\) 0.990 0.284 0.290 0.002 2.025
500 \(SD_{\widehat{Q}}\) 0.132 0.095 0.094 0.142 0.230
500 Bias 0.010 0.016 0.010 -0.002 -0.025
500 Coverage Rate 94.900 93.500 94.400 95.200 95.100
1000 \(\widehat{Q}\) 0.992 0.289 0.290 -0.004 2.019
1000 \(SD_{\widehat{Q}}\) 0.098 0.065 0.064 0.100 0.162
1000 Bias 0.008 0.011 0.010 0.004 -0.019
1000 Coverage Rate 94.400 94.400 93.800 95.900 96.400
10000 \(\widehat{Q}\) 1.000 0.298 0.299 0.001 2.003
10000 \(SD_{\widehat{Q}}\) 0.030 0.021 0.021 0.032 0.050
10000 Bias 0.000 0.002 0.001 -0.001 -0.003
10000 Coverage Rate 94.200 95.000 95.500 93.900 94.700
Table 11: Performance of MI based on a bootstrap regression hurdle NB model
N Parameter \(\beta_0\) \(\beta_1\) \(\beta_2\) \(\gamma_0\) \(\gamma_1\)
500 \(\widehat{Q}\) 0.992 0.280 0.288 0.002 2.039
500 \(SD_{\widehat{Q}}\) 0.134 0.094 0.093 0.145 0.230
500 Bias 0.008 0.020 0.012 -0.002 -0.039
500 Coverage Rate 94.600 92.900 92.800 94.800 94.600
1000 \(\widehat{Q}\) 0.991 0.285 0.289 -0.003 2.024
1000 \(SD_{\widehat{Q}}\) 0.096 0.065 0.066 0.101 0.164
1000 Bias 0.009 0.015 0.011 0.003 -0.024
1000 Coverage Rate 95.400 94.300 94.200 95.800 94.400
10000 \(\widehat{Q}\) 1.000 0.297 0.298 0.001 2.004
10000 \(SD_{\widehat{Q}}\) 0.030 0.021 0.020 0.032 0.050
10000 Bias 0.000 0.003 0.002 -0.001 -0.004
10000 Coverage Rate 93.600 93.800 95.300 95.300 94.600

Results of the NB conditions are displayed in Table 10 and Table 11. Again parameter estimates are typically accurate and coverage is fine for all parameters. In the small sample size condition, with \(N=500\), overdispersion parameter \(\theta\) was slightly overestimated (1.149 for the Bayesian regression method and 1.187 for the bootstrap regression method respectively).

With \(N=1000\), overdispersion parameter \(\theta\) was 1.082 for the Bayesian regression method, and 1.102 for the bootstrap regression method. When \(N=10000\), overdispersion parameter \(\theta\) was 1.010 for the Bayesian regression method, and 1.013 for the bootstrap regression method. In comparison, the mean estimate for \(\theta\) of the complete data sets – i.e. before any data in \(y\) were deleted – was 1.056.

All in all, results suggest that the imputation methods work sufficiently well.

4.8 Imputation of two-level Poisson distributed data

mice.impute.2l.poisson() imputes missing data based on a generalized linear mixed effects Poisson model. Note that the function requires the data to be in long format, i.e. the repeated measurements are stacked upon another. Additional time and id variables indicate measurement timepoint (when the data are panel data) and cluster membership. The function is described in detail in Kleinke & Reinecke (2015a). To specify the imputation model, we use the same coding scheme as other two-level imputation functions in mice. Allowed entries in the predictorMatrix are given in Table 12:

Table 12: Allowed entries in the predictorMatrix
numeric code meaning
0 variable is not included in the imputation model
1 variable is included as a fixed effect
2 variable is included as a fixed and random effect
-2 class variable.

The intercept is always treated as fixed and random factor. If this is not desired, the .noint variant of the function could be used, which treats the intercept only as a fixed, but not as a random effect (see help("mice.impute.2l.poisson")).

4.9 Imputation of overdispersed two-level count data

Method 2l.nb imputes incomplete data based on a two-level NB model (for details, see Kleinke & Reinecke, 2015b). In version 2 of package countimp, we have replaced package glmmADMB by glmmTMB to estimate the two-level NB model. In our practical experience, estimation by glmmTMB is faster and more stable. Note that the function requires data in long format and uses the same coding scheme as other two-level imputation functions in (cf. van Buuren & Groothuis-Oudshoorn, 2011). The specification of the imputation method and model works analogous to other multilevel imputation functions in mice, see for example help("mice.impute.2l.pan"). Allowed entries in the predictorMatrix are listed in Table 13.

Table 13: Allowed entries in the predictorMatrix
numeric code meaning
0 variable is not included in the imputation model
1 variable is included as a fixed effect
2 variable is included as a fixed and random effect
-2 class variable.

4.9.1 Monte Carlo Simulation

We tested the refurbished version of mice.impute.2l.nb by means of Monte Carlo simulation (with 200 replications). The design was very simular to the one used by Kleinke & Reinecke (2015b).

We simulated data of \(g=50\) groups of varying group sizes \(n_j\), \(j\in (1,2,\dots,g)\).

\(n_j\) were drawn randomly from all integer numbers between 100 and 400. The data of the 50 groups were generated separately and then stacked upon each other to create data sets in long format. Each data set consisted of one continuous individual level predictor \(x\), one continuous group level predictor \(z\), and the overdispersed dependent count variable \(y\). The intercept and the slope of \(x\) were treated as random factors (i.e. they were allowed to vary across the 50 groups). We set the following population parameters: \(\beta_0=1\), the fixed intercept term, \(\beta_1=.75\), the coefficient of the individual level predictor \(x\), \(\beta_2=.5\), the coefficient of the group level predictor \(z\), and \(\theta=2\), the overdispersion parameter. Random effects standard deviations were \(.5\) and \(.3\) for the intercept term (\(\tau_{00}\)) and the slope of \(x\) (\(\tau_{11}\)) respectively. The random effects correlation (\(\tau_{01}\)) was set to \(0\). \(x_j\), \(z_j\), and the individual level residuals \(e_j\) were simulated from \(\mathcal{N}(0,1)\) normal populations. Random effects were also drawn from normal \(\mathcal{N}(0,\tau)\) populations. Data were generated as follows.

We first obtained parameters

\[\begin{equation*} {\beta_{j}}={\Gamma}{Z_{j}}+{u_{j}} \end{equation*}\]

where

\[\begin{equation*} {\beta_{j}}=\begin{bmatrix} {\beta_{0j}} \\ {\beta_{1j}} \end{bmatrix}, {\Gamma} = \begin{bmatrix} \gamma_{00} & \gamma_{01} \\ \gamma_{10} & \gamma_{11} \\ \end{bmatrix} = \begin{bmatrix} 1 & .5 \\ 0.75 & 0 \\ \end{bmatrix}, {Z_j}=\begin{bmatrix} 1 \\ z_{j} \end{bmatrix}, \text{and } {u_{j}}=\begin{bmatrix} {u_{0j}} \\ {u_{1j}} \end{bmatrix}. \end{equation*}\]

For each individual \(i\), the dependent count variable \(y_{ij}\) was then drawn from \(y_{ij}\sim\text{NB}(\mu_{ij}=\exp(X_{ij}\beta_j),\theta=2)\), using the rnegbin() function from R package MASS, where \(X_{ij}=\begin{bmatrix}1&x_{ij}\end{bmatrix}\) .

In the next step, we introduced missing data in \(y\). Predictors \(x\), and \(z\) remained fully observed. We introduced missing data according to the following rule, where \(p_{ij}\) is the missingness probability and NA denotes a missing value (i.e. “not available”):

\[\begin{align*} p_{ij}&=\text{invlogit}(-1+z_{ij}) \\ v_{ij}&\sim \mathcal{U}(0,1) \\ y_{ij} &= \text{NA, } \text{if } v_{ij}<p_{ij}. \end{align*}\]

This created an average percentage of missing data in \(y\) of 30.2% (\(SD=3.0\%\)).

The incomplete variable \(y\) was then imputed \(m=5\) times using both the Bayesian regression and bootstrap regression versions of the function. The imputation model included variables \(x\) and \(z\). The intercept and the slope of \(x\) were treated as random factors. We used the mice default settings and monitored convergence as outlined in van Buuren & Groothuis-Oudshoorn (2011). The data sets were then analyzed by the same NB2 model that was used for data imputation. Results were combined using Rubin’s formula (Rubin, 1987). Results are displayed in Table 14 and Table 15. The tables display (a) the average parameter estimate across the 200 replications, (b) bias, which is defined as the difference between the true parameter and the average parameter estimate, (c) %Bias, which is bias, devided by the true parameter multiplied by 100%, and 95% coverage (i.e. the percentage of confidence intervals that include the true parameter).

Table 14: Performance of MI based on a Bayesian regression two-level NB2 model
Parameter \(\beta_0\) \(\beta_1\) \(\beta_2\) \(\sigma_{00}\) \(\sigma_{11}\) \(\sigma_{01}\) \(\theta\)
Q 1.000 0.750 0.500 0.500 0.300 0.000 2.000
\(\widehat{Q}\) 0.996 0.748 0.503 0.471 0.294 -0.088 1.961
Bias 0.004 0.002 -0.003 0.029 0.006 0.088 0.039
%Bias 0.400 0.267 -0.600 5.800 2.000 Inf 1.950
Coverage Rate 93.500 95.500 94.500 NA NA NA NA
Table 15: Performance of MI based on a bootstrap regression two-level NB2 model
Parameter \(\beta_0\) \(\beta_1\) \(\beta_2\) \(\sigma_{00}\) \(\sigma_{11}\) \(\sigma_{01}\) \(\theta\)
Q 1.000 0.750 0.500 0.500 0.300 0.000 2.000
\(\widehat{Q}\) 0.996 0.748 0.504 0.472 0.295 -0.102 1.977
Bias 0.004 0.002 -0.004 0.028 0.005 0.102 0.023
%Bias 0.400 0.267 -0.800 5.600 1.667 Inf 1.150
Coverage Rate 94.000 95.500 93.500 NA NA NA NA

As can be seen, the average parameter estimates are close to the defined population quantities, %Bias is below 10% and coverage is above 90% for all parameters, which means that the imputation function works nicely when (like in this case), model assumptions are met.

4.10 Imputation of zero-inflated two-level count data

Methods "2l.hp" and "2l.hnb" impute zero-inflated two-level count data based on a two-level hurdle model. Analogously, methods "2l.zip" and "2l.zinb" create imputations based on a two-level zero-inflated Poisson or NB model respectively. The .noint variants of the functions treat the intercept only as a fixed, but not as a random effect. It may be specified, if the intercept is excluded from the random part of the zero model (.noint.zero), the random part of the count model (.noint.count), or from the random part of both models (.noint.both). For details about all model variants, see help("mice.impute.2l.hnb"

Different sets of predictors can be specified for the zero and count parts of the model. Allowed entries in the predictorMatrix are listed in Table Table 16:

Table 16: Allowed entries in the predictorMatrix
numeric code meaning
0 variable is not included in the imputation model
1 variable is included as a fixed effect
2 variable is included as a fixed and random effect
3 variable is included as a fixed effect (count model only)
4 variable is included as a fixed and random effect (count model only)
5 variable is included as a fixed effect (zero model only)
6 variable is included as a fixed and random effect (zero model only)
-2 class variable

4.10.1 Monte Carlo Simulation

We evaluated these new or updated imputation procedures for zero-inflated two-level data by means of Monte Carlo simulation (with 200 replications). The design was similar to the one used by Kleinke & Reinecke (2015b): We simulated data of \(g=50\) groups of sample sizes \(n_j=100\), \(j \in 1\dots,g\). The data of the 50 groups were generated separately and then stacked upon each other to create data sets in long format. Each data set consisted of one continuous individual level predictor \(x\), and the overdispersed dependent count variable \(y\). \(x\) both influenced the zero and the count process. The intercept and the slope of \(x\) were treated as random in both the zero and the count part of the model. We set the following population parameters: \(\beta_0=1\), and \(\beta_1=.75\) the fixed intercept and slope terms respectively of the count part of the model, and \(\gamma_0=.0\), and \(\gamma_1=.5\) the fixed intercept and slope terms in the zero part of the model. Random effects standard deviations were \(.5\) and \(.3\) for the intercept term (\(\tau_{00}\)) and the slope of \(x\) (\(\tau_{11}\)) respectively. The random effects correlation (\(\tau_{01}\)) was set to \(0\). \(x_{ij}\) were simulated from a \(\mathcal{N}(0,1)\) normal population. Random effects \(u\) were also drawn from normal \(\mathcal{N}(0,\tau)\) populations. Data were generated as follows. We first obtained parameters for the zero part and the count part of the model. For each group \(j\), \(j\in{1,\dots,g}\) we simulated parameters:

\[\begin{align*} \gamma_{0j}&=0+u_{0jz}\\ \gamma_{1j}&=.5+u_{1jz}\\ \beta_{0j}&=1+u_{0jc}\\ \beta_{1j}&=.75+u_{1jc} \end{align*}\]

The probability of belonging to the zero versus non-zero category was then determined by

\[\begin{equation*} \mu_{ij}=\text{invlogit}(\gamma_{0j}+\gamma_{1j}x_{1i}) \end{equation*}\]

We then drew random uniform \(z\sim\mathcal{U}(0,1)\) numbers and if \(z_{ij}<\mu_{ij}\) then the dependent variable \(y\) was set to 0, and 1 otherwise. In the next step, a nonzero observation in \(y\) was replaced by a random draw from a zero-truncated Poisson distribution with mean \(\exp(\beta_{0j}+\beta_{1j}x_{1ij})\) in the Poisson condition, and by a random draw from a zero-truncated NB distribution with mean \(\exp(\beta_{0j}+\beta_{1j} x_{1ij})\) and overdispersion parameter \(\theta=2\) in the NB condition.

Then, we introduced missing data in \(y\). Predictor \(x\) remained fully observed. We introduced missing data according to the following rule, where \(p_{ij}\) is the missingness probability and NA denotes a missing value (i.e. “not available”):

\[\begin{align*} p_{ij}&=\text{invlogit}(-1+x_{ij}) \\ v_{ij}&\sim \mathcal{U}(0,1) \\ y_{ij} &= \text{NA, } \text{if } v_{ij}<p_{ij}. \end{align*}\]

This created an average percentage of missing data in \(y\) of 30.4%.

The incomplete variable \(y\) was then imputed \(m=5\) times using both the Bayesian regression and bootstrap regression versions of the respective function (hurdle Poisson imputation in the Poisson condition, and hurdle NB imputation in the negative binomial condition). \(x\) was included as predictor for the zero and count part of the model. The intercept and the slope of \(x\) were treated as random factors (again in both parts of the model). We used the mice default settings and monitored convergence (for details, see Chapter 5). The \(m\) completed data sets were then analyzed by the same model that was used for data imputation. Results were combined using Rubin’s formula (Rubin, 1987).

Results of hurdle Poisson imputation are displayed in Table 17 and Table 18, results of hurdle NB imputation are displayed in Table 19 and Table 20. The tables display the average parameter estimate across the 200 replications, the standard deviation of the estimates across the replications, bias, percent bias and 95% coverage.

Table 17: Performance of MI based on a Bayesian regression two-level hurdle Poisson model
\(Q\) \(\widehat{Q}\) Bias %Bias Coverage Rate
\(\beta_0\) 1.00 1.001 -0.001 -0.100 94.924
\(\beta_1\) 0.75 0.746 0.004 0.533 93.909
\(\gamma_0\) 0.00 -0.006 0.006 Inf 92.893
\(\gamma_1\) 0.50 0.492 0.008 1.600 96.447
\(\sigma_{00c}\) 0.50 0.491 0.009 1.800 NA
\(\sigma_{11c}\) 0.30 0.283 0.017 5.667 NA
\(\sigma_{01c}\) 0.00 -0.008 0.008 Inf NA
\(\sigma_{00z}\) 0.50 0.477 0.023 4.600 NA
\(\sigma_{11z}\) 0.30 0.271 0.029 9.667 NA
\(\sigma_{01z}\) 0.00 -0.053 0.053 Inf NA
Table 18: Performance of MI based on a bootstrap regression two-level hurdle Poisson model
\(Q\) \(\widehat{Q}\) Bias %Bias Coverage Rate
\(\beta_0\) 1.00 1.011 -0.011 -1.100 94.416
\(\beta_1\) 0.75 0.748 0.002 0.267 96.447
\(\gamma_0\) 0.00 -0.006 0.006 Inf 91.878
\(\gamma_1\) 0.50 0.495 0.005 1.000 91.878
\(\sigma_{00c}\) 0.50 0.446 0.054 10.800 NA
\(\sigma_{11c}\) 0.30 0.261 0.039 13.000 NA
\(\sigma_{01c}\) 0.00 -0.127 0.127 Inf NA
\(\sigma_{00z}\) 0.50 0.439 0.061 12.200 NA
\(\sigma_{11z}\) 0.30 0.269 0.031 10.333 NA
\(\sigma_{01z}\) 0.00 -0.108 0.108 Inf NA
Table 19: Performance of MI based on a Bayesian regression two-level hurdle NB model
\(Q\) \(\widehat{Q}\) Bias %Bias Coverage Rate
\(\beta_0\) 1.00 0.993 0.007 0.700 93.939
\(\beta_1\) 0.75 0.735 0.015 2.000 92.424
\(\gamma_0\) 0.00 -0.003 0.003 Inf 93.939
\(\gamma_1\) 0.50 0.497 0.003 0.600 96.970
\(\sigma_{00c}\) 0.50 0.480 0.020 4.000 NA
\(\sigma_{11c}\) 0.30 0.276 0.024 8.000 NA
\(\sigma_{01c}\) 0.00 -0.059 0.059 Inf NA
\(\sigma_{00z}\) 0.50 0.485 0.015 3.000 NA
\(\sigma_{11z}\) 0.30 0.274 0.026 8.667 NA
\(\sigma_{01z}\) 0.00 -0.054 0.054 Inf NA
\(\theta\) 2.00 2.098 -0.098 -4.900 NA
Table 20: Performance of MI based on a bootstrap regression two-level hurdle NB model
\(Q\) \(\widehat{Q}\) Bias %Bias Coverage Rate
\(\beta_0\) 1.00 0.996 0.004 0.400 92.929
\(\beta_1\) 0.75 0.743 0.007 0.933 93.939
\(\gamma_0\) 0.00 -0.001 0.001 Inf 93.434
\(\gamma_1\) 0.50 0.500 0.000 0.000 95.455
\(\sigma_{00c}\) 0.50 0.440 0.060 12.000 NA
\(\sigma_{11c}\) 0.30 0.262 0.038 12.667 NA
\(\sigma_{01c}\) 0.00 -0.151 0.151 Inf NA
\(\sigma_{00z}\) 0.50 0.444 0.056 11.200 NA
\(\sigma_{11z}\) 0.30 0.268 0.032 10.667 NA
\(\sigma_{01z}\) 0.00 -0.127 0.127 Inf NA
\(\theta\) 2.00 1.991 0.009 0.450 NA

Estimates are sufficiently accurate. It appears that the bootstrap variants performed slightly worse than the Bayesian variants, especially regarding the random part of the model. This might be caused by the block bootstrap approach implemented here (we re-sample clusters rather than individual cases). This approach requires a substantial number of clusters. Future research needs to establish under which scenarios it is better to use the Baysian variants of the functions and when to use the bootstrap variants. Little is yet known about level-1 and level-2 sample size requirements.

Contact

PD Dr. Kristian Kleinke
University of Siegen
Department of Psychology
Obergraben 23 D-57072 Siegen
kristian.kleinke@uni-siegen.de

References

Barnard, J., & Rubin, D. B. (1999). Small-sample degrees of freedom with multiple imputation. Biometrika, 86(4), 948–955.
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive missing-data strategies in modern missing-data procedures. Psychological Methods, 6(4), 330–351. https://doi.org/10.1037/1082-989X.6.4.330
Gaffert, P., Meinfelder, F., & Bosch, V. (2016). midastouch: Towards an MI-proper predictive mean matching [Discussion paper]. https://www.uni-bamberg.de/fileadmin/uni/fakultaeten/sowi_lehrstuehle/statistik/Personen/Dateien_Florian/properPMM.pdf
Hilbe, J. M. (2011). Negative binomial regression (\(2^{\text{nd}}\) ed.). Cambridge University Press.
Kleinke, K. (2017). Multiple imputation under violated distributional assumptions – a systematic evaluation of the assumed robustness of predictive mean matching. Journal of Educational and Behavioral Statistics, 42(4), 371–404. https://doi.org/10.3102/1076998616687084
Kleinke, K., & Reinecke, J. (2013). Multiple imputation of incomplete zero-inflated count data. Statistica Neerlandica, 67(3), 311–336. https://doi.org/10.1111/stan.12009
Kleinke, K., & Reinecke, J. (2015a). Multiple imputation of multilevel count data. In U. Engel, B. Jann, P. Lynn, A. Scherpenzeel, & P. Sturgis (Eds.), Improving Survey Methods: Lessons from Recent Research (pp. 381–396). Routledge, Taylor & Francis.
Kleinke, K., & Reinecke, J. (2015b). Multiple imputation of overdispersed multilevel count data. In U. Engel (Ed.), Survey Measurements. Techniques, Data Quality and Sources of Error (pp. 209–226). Campus/The University of Chicago Press.
Lambert, D. (1992). Zero-inflated poisson regression. With an application to defects in manufacturing. Technometrics, 34(1), 1–14.
Mullahy, J. (1986). Specification and testing of some modified count data models. Journal of Econometrics, 33(3), 341–365. https://doi.org/10.1016/0304-4076(86)90002-3
Raghunathan, T. E., Lepkowski, J. M., Van Hoewyk, J., & Solenberger, P. (2001). A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology, 27(1), 85–96.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592. https://doi.org/10.1093/biomet/63.3.581
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. Wiley.
Schafer, J. L. (1997a). Analysis of incomplete multivariate data. Chapman & Hall.
Schafer, J. L. (1997b). Imputation of missing covariates under a general linear mixed model [Technical Report 97-10]. Pennsylvania State University, The Methodology Center.
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177. https://doi.org/10.1037/1082-989X.7.2.147
van Buuren, S. (2012). Flexible imputation of missing data. Chapman & Hall / CRC.
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). MICE: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67.
von Hippel, P. T. (2013). Should a normal imputation model be modified to impute skewed variables? Sociological Methods & Research, 42(1), 105–138.
Yu, L. M., Burton, A., & Rivero-Arias, O. (2007). Evaluation of software for multiple imputation of semi-continuous data. Statistical Methods in Medical Research, 16(3), 243–258. https://doi.org/10.1177/0962280206074464
Zeileis, A., Kleiber, C., & Jackman, S. (2008). Regression models for count data in R. Journal of Statistical Software, 27(8), 1–25.