Chapter 4 countimp
4.1 A cautionary note
Package countimp
creates imputations based on parametric count models. Plausibilty of the obtained imputations will depend upon the correct specification of the imputation procedure, i.e. the count data regression model that is used to predict missing information must fit the empirical data well.
Research by Kleinke & Reinecke (2013) suggests that misspecifications of the (count) imputation model can bias statistical inferences quite noticeably. We therefore advise researchers to spend sufficient time on modelling. We would like to refer readers to the excellent book by Hilbe (2011) to determine which count model fits the data best.
We would also like to draw attention to R packages Qtools
and ImputeRobust
, which allow to create imputations based on nonparametric quantile regression and semi-parametric generalized additive models, which might be a suitable alternative in situations, where assumptions of parametric count models are violated.
4.2 Count data models
An overview of parametric imputation models available in package countimp
is given in Table 4.1. More details on the theoretical background of count regression models is given in Zeileis et al. (2008), and in Hilbe (2011).
Function Name | model / family |
---|---|
mice.impute.poisson() |
Poisson |
mice.impute.quasipoisson() |
Quasi-Poisson |
mice.impute.nb() |
negative binomial |
mice.impute.zip() |
zero-inflated Poisson |
mice.impute.zinb() |
zero-inflated negative binomial |
mice.impute.hp() |
hurdle Poisson |
mice.impute.hnb() |
hurdle negative binomial |
mice.impute.2l.poisson() |
two-level Poisson |
mice.impute.2l.nb() |
two-level negative binomial |
mice.impute.2l.zip() |
two-level zero-inflated Poisson |
mice.impute.2l.zinb() |
two-level zero-inflated NB |
mice.impute.2l.hp() |
two-level hurdle Poisson |
mice.impute.2l.hnb() |
two-level hurdle negative binomial |
4.3 Installation
The latest version can be installed from GitHub using function install_git
from package devtools
:
devtools::install_git(url = "https://github.com/kkleinke/countimp",
branch = "master")
4.4 Imputation of incomplete Poisson distributed variables
The function to impute an incomplete Poisson distributed variable is called mice.impute.poisson()
. Appending .boot
to the function name calls the bootstrap regression variant (see help("mice.impute.poisson.boot")
). The functions have the following arguments:
library(countimp)
args("mice.impute.poisson")
## function (y, ry, x, wy = NULL, EV = TRUE, ...)
## NULL
Arguments y
, ry
, x
need not be specified. They are automatically obtained from mice
. For information on how to use mice
’s new wy
-argument, and for further information, see help("mice")
.
Numeric vector y
is the incomplete Poisson distributed variable that shall be imputed. ry
is the response indicator of y
, with TRUE
indicating an observed value and FALSE
a missing observation in vector y
respectively. wy
is a logical vector of length(y)
. TRUE
statements indicate locations in y
for which imputations shall be generated (default is !ry
). x
is a matrix with length(y)
rows containing complete covariates – the variables in the imputation model, which has been specified via mice
’s predictorMatrix
argument. The variables in x
are used to predict missing information in y
. Argument EV
defines how the function handles extreme imputations. When set to TRUE
, the function automatically detects extreme values among the imputations and replaces them by values obtained by predictive mean matching.
We use the standard glm.fit
function from package stats
to fit the Poisson model to the observed part of the data, i.e. x[ry,]
, y[ry]
. Imputations are drawn from a Poisson distribution with mean \(\lambda\), which is the value predicted for the respective incomplete case.
4.4.1 Example
We generate an example data set with a Poisson distributed dependent variable \(y\) and three continuous predictors \(x1\)–\(x3\) and introduce MAR missingness in the dependent variable.
set.seed( 1234 )
b0 <- 1
b1 <- .75
b2 <- -.25
b3 <- .5
N <- 5000
x1 <- rnorm(N)
x2 <- rnorm(N)
x3 <- rnorm(N)
lam <- exp( b0 + b1 * x1 + b2 * x2 + b3 * x3 )
y <- rpois( N, lam )
POIS <- data.frame( y, x1, x2, x3 )
## introduce MAR missingness to simulated data
generate.md <- function( data, pos = 1, Z = 2, pmis = .5, strength = c( .5, .5 ) )
{
total <- round( pmis * nrow(data) )
sm <- which( data[,Z] < mean( data[,Z] ) )
gr <- which( data[,Z] > mean( data[,Z] ) )
sel.sm <- sample( sm, round( strength[1] * total ) )
sel.gr <- sample( gr, round( strength[2] * total ) )
sel <- c( sel.sm, sel.gr )
data[sel,pos] <- NA
return(data)
}
MPOIS <- generate.md( POIS, pmis = .2, strength = c( .2, .8) )
This procedure generates 20% missing data in y
. Missingness depends on variable x1
. Cases whose x1
-value is above the mean of x1
are more likely to have a missing value in y
. The resulting incomplete data set is called MPOIS
.
Imputations are then generated using wrapper function countimp()
, which passes the arguments on to mice
.
To select function mice.impute.poisson()
to impute missing data in \(y\), the respective slot in method
has to be set to "poisson"
. See help("mice")
for details.
colnames(MPOIS)
## [1] "y" "x1" "x2" "x3"
MPOIS
has four variables. Thus, method
has to be a vector of length 4. y
is the first variable in MPOIS
. We therefore need to set the first entry in method
to "poisson"
(or "poisson.boot"
for the bootstrap variant). The other variables are completely observed and get method ""
.
imp <- countimp( MPOIS, method = c( "poisson" ,"" ,"" ,"" ), print=FALSE)
impb <- countimp( MPOIS, method = c( "poisson.boot" ,"" ,"" ,"" ), print=FALSE)
Note that if predictorMatrix
is not specified, all other variables are used to predict missing information in y
.
The resulting object imp
is of class mids
(a multiply imputed data set). Standard mice
functions like with()
and pool()
can be applied to that object to automate repeated data analysis and pooling of results.
repeated.analysis <- with(imp, glm(y~x1+x2+x3,family = poisson))
summary(pool(repeated.analysis))
## est se t df Pr(>|t|)
## (Intercept) 0.9779894 0.009909391 98.69320 700.87521 0
## x1 0.7507701 0.007478281 100.39341 1030.71997 0
## x2 -0.2342093 0.008082908 -28.97588 91.46838 0
## x3 0.4846562 0.008227717 58.90531 55.51060 0
## lo 95 hi 95 nmis fmi lambda
## (Intercept) 0.9585338 0.9974451 NA 0.07225868 0.06961507
## x1 0.7360957 0.7654445 0 0.05690214 0.05507394
## x2 -0.2502639 -0.2181547 0 0.22348656 0.20669134
## x3 0.4681709 0.5011415 0 0.29147133 0.26639537
For details about the respective slots of the output, see van Buuren & Groothuis-Oudshoorn (2011).
4.5 Imputation of overdispersed count data
countimp
has two methods to impute overdispersed count:
- imputation based on a quasi-Poisson model, and
- imputation based on a negative binomial model.
Both methods can be used when the equidispersion assumption of the standard Poisson model is violated, i.e. when the variance of the count variable is larger in comparison to its mean.
4.5.1 Imputation based on a quasi-Poisson model
Method "quasipoisson"
imputes univariate missing data based on a Bayesian quasi-Poisson GLM using the standard R function glm.fit
(i.e. using family = quasipoisson(link = log)
). Imputations are simulated from a negative binomial distribution to ensure an adequate dispersion of imputed values: The quasi-Poisson model estimates Pearson dispersion \(\delta\) that gives the amount of overdispersion in the data. Pearson dispersion \(\delta\) is stored in slot $dispersion
of the output of the summary.glm
function. We supply this parameter to the rnbinom()
function from package stats
to obtain the imputations. For details about the parametrization of the NB distribution used by rnbinom
, see help("rnbinom")
.
4.5.2 Example
We simulate an overdispersed count variable y
and generate about 20% missing data in this variable. Imputation, repeated data analysis and pooling of results works analogous to Section 4.4.1.
## simulate overdispersed count data
set.seed( 1234 )
b0 <- 1
b1 <- .75
b2 <- -.25
b3 <- .5
N <- 5000
x1 <- rnorm(N)
x2 <- rnorm(N)
x3 <- rnorm(N)
mu <- exp( b0 + b1 * x1 + b2 * x2 + b3 * x3 )
y <- MASS::rnegbin( N, theta = 2, mu )
NB <- data.frame( y, x1, x2, x3 )
## introduce MAR missingness to simulated data
total <- round( .2 * N ) ##number of missing data in y
sm <- which( NB[,2] < mean( NB[,2] ) ) ##subset: cases with x2<mean(x2)
gr <- which( NB[,2] > mean( NB[,2] ) ) ##subset: cases with x2>mean(x2)
sel.sm <- sample( sm, round( .2 * total ) ) ##select cases to set as missing
sel.gr <- sample( gr, round( .8 * total ) ) ##select cases to set as missing
sel <- c( sel.sm,sel.gr )
MNB <- NB
MNB[sel,1] <- NA ##delete selected data
## impute missing data
imp <- countimp( MNB, method = c( "quasipoisson", "", "", "" ), print = FALSE)
## repeated data analysis and pooling of results
repeated.analysis <- with(imp, glm(y~x1+x2+x3,family=quasipoisson))
summary(pool(repeated.analysis))
## est se t df Pr(>|t|) lo 95
## (Intercept) 1.0075246 0.01613825 62.43084 4358.0686 0 0.9758854
## x1 0.7450712 0.01338057 55.68306 163.5364 0 0.7186503
## x2 -0.2469628 0.01274407 -19.37865 493.9668 0 -0.2720021
## x3 0.4555993 0.01448504 31.45310 38.6698 0 0.4262926
## hi 95 nmis fmi lambda
## (Intercept) 1.0391638 NA 0.01086766 0.01041384
## x1 0.7714922 0 0.16350855 0.15334069
## x2 -0.2219235 0 0.08866686 0.08498446
## x3 0.4849061 0 0.35243277 0.31978490
4.6 Imputation based on a negative binomial model
Method "nb"
fits a negative binomial model using function glm.nb()
from package MASS
and simulates the imputations from a negative binomial distribution, using function rnegbin()
from package MASS
.
The imputation process works analogous to Section 4.4.1.
4.7 Imputation of count data with an excess number of zero counts
Methods zip
and zinb
impute missing data based on a zero-inflated Poisson or NB model respectively (for details, see Kleinke & Reinecke, 2013). Methods hp
and hnb
use a hurdle Poisson or NB model.
Users can specify different sets of predictors for the zero and count parts of the models. Allowed entries in the predictorMatrix
are listed in Table 4.2.
numeric code | meaning |
---|---|
0 | variable is not included in either model |
1 | variables is included in both the zero and the count model |
2 | count model predictor |
3 | zero-model predictor. |
Note that the current mice
version does not allow entries in the predictorMatrix
apart from 0 and 1, unless the model is a multilevel model. Two-level imputation models in mice
have prefix .2l
. Kleinke & Reinecke (2013) therefore used this workaround and labeled the functions 2l.zip
and 2l.zinb
– which were included in countimp
version 1 to allow the specification of different sets of predictors for the zero and count parts of the model via the predictorMatrix
. This however lead to some confusion among empirical researchsers, since these functions did not fit two-level models.
Caution: In version 2 of package
countimp
, methods2l.zip
and2l.zinb
in fact fit two-level models – see Section 4.10.
In version 2 of package countimp
, we have found a better solution. Wrapper function countimp
takes care of the problem. It is therefore important that users call countimp()
rather than mice()
to create the imputations, when the imputation model is either a zero-inflation model or a hurdle model. countimp()
takes the same arguments as mice()
and passes them on to mice
. Imputations are then created by mice
.
4.7.1 Monte Carlo Simulation
Kleinke & Reinecke (2013) have tested the zero-inflation functions (methods "2l.zip"
and "2l.zinb"
) in various scenarios and found that the methods performed adequately well, when the empricial data were in fact zero-inflated Poisson or negative binomial. Here, we use a highly similar set-up to evaluate the new hurdle model variants of these functions, which are availabe in version 2 of the countimp
package.
Note that unlike zero-inflation models, hurdle models allow only one source of zeros. The zero model determines, if case \(i\) has a zero or non-zero observation and the count model determines, which non-zero observation case \(i\) has. Hurdle models assume that the residuals of the hurdle component (here the zero model) and the outcome model (here the count model) are uncorrelated. This assumption is plausible if it is quite distinct groups who form the zero versus non-zero category.
An example, where a hurdle model might be an appropriate analyis model, is the count of cigarettes smoked per day. Non-smokers will quite certainly have a zero count, while smokers will vary in their number of cigarettes smoked per day. Furthermore, there might be different sets of variables that will predict, if persons are smokers or non-smokers, while other variables will predict, how many cigarettes a smoker will consume per day.
To test the new hurdle model imputation functions, we created 1000 data sets with sample sizes \(N\in\{500,1000,10000\}\) respectively that contained four variables: \(y\) the dependent count variable that has an excess number of zeros, and continuous predictors \(x1\), \(x2\), and \(x3\).
\(x_1\)–\(x_3\) were drawn from normal \(\mathcal{N}(0,1)\) populations. \(\mu_{0i}\) was the individual probability of belonging to the count versus the zero-group and was determined by a logit model
\[\begin{equation} \mu_{0i}=\text{invlogit}(2\cdot x_3) \end{equation}\]The respective observation in \(y\) was drawn from a binomial distribution with size 1 and probabiliy \(\mu_{0i}\). Finally, a “1” in \(y\) was replaced by a draw from a zero-truncated Poisson distribution with mean \(\exp(1+.3\cdot x_{1i}+.3\cdot x_{2i})\) in the Poisson conditions or from a zero-truncated negative binomial distribution with mean \(\exp(1+.3\cdot x_{1i}+.3\cdot x_{2i})\) and overdispersion parameter \(\theta=1\) in the negative binomial conditions. MAR missingness was then introduced as outlined in Kleinke & Reinecke (2013). Missing data were imputed \(m=5\) times using both the Bayesian regression and the bootstrap regression variants of our hurdle model imputation functions. Function hurdle
from package pscl
was used to fit the corresponding hurdle model to the completed data sets. Results of the Poisson conditions are displayed in Tables 4.3 and 4.4. \(\beta\) are the coefficients in the count part of the model, \(\gamma\) the coefficients in the zero model. \(Q\) are the simulated true population quantities, \(\widehat{Q}\) the Monte Carlo estimates, \(SD_{\widehat{Q}}\) is the standard deviation of the estimates across the 1000 replications. We also report %Bias and 95% coverage rates. As can be seen in Tables 4.3 and 4.4, estimates are always very close to the true population quantities, bias is very small and coverage is adequate for all parameters regardless of sample size.
N | Parameter | \(\beta_0\) | \(\beta_1\) | \(\beta_2\) | \(\gamma_0\) | \(\gamma_1\) |
---|---|---|---|---|---|---|
500 | \(\widehat{Q}\) | 0.996 | 0.295 | 0.294 | -0.005 | 2.028 |
500 | \(SD_{\widehat{Q}}\) | 0.052 | 0.049 | 0.050 | 0.145 | 0.228 |
500 | Bias | 0.004 | 0.005 | 0.006 | 0.005 | -0.028 |
500 | Coverage Rate | 96.700 | 95.300 | 95.800 | 95.300 | 94.900 |
1000 | \(\widehat{Q}\) | 0.997 | 0.297 | 0.296 | -0.003 | 2.014 |
1000 | \(SD_{\widehat{Q}}\) | 0.039 | 0.034 | 0.036 | 0.101 | 0.163 |
1000 | Bias | 0.003 | 0.003 | 0.004 | 0.003 | -0.014 |
1000 | Coverage Rate | 94.600 | 95.200 | 95.200 | 95.800 | 95.400 |
10000 | \(\widehat{Q}\) | 1.000 | 0.299 | 0.300 | -0.001 | 2.004 |
10000 | \(SD_{\widehat{Q}}\) | 0.012 | 0.011 | 0.011 | 0.032 | 0.048 |
10000 | Bias | 0.000 | 0.001 | 0.000 | 0.001 | -0.004 |
10000 | Coverage Rate | 94.400 | 94.000 | 96.100 | 95.200 | 94.900 |
N | Parameter | \(\beta_0\) | \(\beta_1\) | \(\beta_2\) | \(\gamma_0\) | \(\gamma_1\) |
---|---|---|---|---|---|---|
500 | \(\widehat{Q}\) | 0.994 | 0.293 | 0.293 | -0.007 | 2.037 |
500 | \(SD_{\widehat{Q}}\) | 0.052 | 0.049 | 0.049 | 0.147 | 0.234 |
500 | Bias | 0.006 | 0.007 | 0.007 | 0.007 | -0.037 |
500 | Coverage Rate | 95.400 | 95.300 | 96.200 | 95.100 | 95.000 |
1000 | \(\widehat{Q}\) | 0.996 | 0.295 | 0.295 | -0.005 | 2.017 |
1000 | \(SD_{\widehat{Q}}\) | 0.039 | 0.034 | 0.035 | 0.101 | 0.164 |
1000 | Bias | 0.004 | 0.005 | 0.005 | 0.005 | -0.017 |
1000 | Coverage Rate | 94.600 | 95.200 | 95.100 | 95.500 | 93.800 |
10000 | \(\widehat{Q}\) | 1.000 | 0.299 | 0.300 | -0.001 | 2.004 |
10000 | \(SD_{\widehat{Q}}\) | 0.012 | 0.011 | 0.011 | 0.032 | 0.049 |
10000 | Bias | 0.000 | 0.001 | 0.000 | 0.001 | -0.004 |
10000 | Coverage Rate | 93.900 | 94.100 | 94.400 | 94.200 | 95.300 |
N | Parameter | \(\beta_0\) | \(\beta_1\) | \(\beta_2\) | \(\gamma_0\) | \(\gamma_1\) |
---|---|---|---|---|---|---|
500 | \(\widehat{Q}\) | 0.990 | 0.284 | 0.290 | 0.002 | 2.025 |
500 | \(SD_{\widehat{Q}}\) | 0.132 | 0.095 | 0.094 | 0.142 | 0.230 |
500 | Bias | 0.010 | 0.016 | 0.010 | -0.002 | -0.025 |
500 | Coverage Rate | 94.900 | 93.500 | 94.400 | 95.200 | 95.100 |
1000 | \(\widehat{Q}\) | 0.992 | 0.289 | 0.290 | -0.004 | 2.019 |
1000 | \(SD_{\widehat{Q}}\) | 0.098 | 0.065 | 0.064 | 0.100 | 0.162 |
1000 | Bias | 0.008 | 0.011 | 0.010 | 0.004 | -0.019 |
1000 | Coverage Rate | 94.400 | 94.400 | 93.800 | 95.900 | 96.400 |
10000 | \(\widehat{Q}\) | 1.000 | 0.298 | 0.299 | 0.001 | 2.003 |
10000 | \(SD_{\widehat{Q}}\) | 0.030 | 0.021 | 0.021 | 0.032 | 0.050 |
10000 | Bias | 0.000 | 0.002 | 0.001 | -0.001 | -0.003 |
10000 | Coverage Rate | 94.200 | 95.000 | 95.500 | 93.900 | 94.700 |
N | Parameter | \(\beta_0\) | \(\beta_1\) | \(\beta_2\) | \(\gamma_0\) | \(\gamma_1\) |
---|---|---|---|---|---|---|
500 | \(\widehat{Q}\) | 0.992 | 0.280 | 0.288 | 0.002 | 2.039 |
500 | \(SD_{\widehat{Q}}\) | 0.134 | 0.094 | 0.093 | 0.145 | 0.230 |
500 | Bias | 0.008 | 0.020 | 0.012 | -0.002 | -0.039 |
500 | Coverage Rate | 94.600 | 92.900 | 92.800 | 94.800 | 94.600 |
1000 | \(\widehat{Q}\) | 0.991 | 0.285 | 0.289 | -0.003 | 2.024 |
1000 | \(SD_{\widehat{Q}}\) | 0.096 | 0.065 | 0.066 | 0.101 | 0.164 |
1000 | Bias | 0.009 | 0.015 | 0.011 | 0.003 | -0.024 |
1000 | Coverage Rate | 95.400 | 94.300 | 94.200 | 95.800 | 94.400 |
10000 | \(\widehat{Q}\) | 1.000 | 0.297 | 0.298 | 0.001 | 2.004 |
10000 | \(SD_{\widehat{Q}}\) | 0.030 | 0.021 | 0.020 | 0.032 | 0.050 |
10000 | Bias | 0.000 | 0.003 | 0.002 | -0.001 | -0.004 |
10000 | Coverage Rate | 93.600 | 93.800 | 95.300 | 95.300 | 94.600 |
Results of the NB conditions are displayed in Tables 4.5 and 4.6. Again parameter estimates are typically accurate and coverage is fine for all parameters. In the small sample size condition, with \(N=500\), overdispersion parameter \(\theta\) was slightly overestimated (1.149 for the Bayesian regression method and 1.187 for the bootstrap regression method respectively).
With \(N=1000\), overdispersion parameter \(\theta\) was 1.082 for the Bayesian regression method, and 1.102 for the bootstrap regression method. When \(N=10000\), overdispersion parameter \(\theta\) was 1.010 for the Bayesian regression method, and 1.013 for the bootstrap regression method. In comparison, the mean estimate for \(\theta\) of the complete data sets – i.e. before any data in \(y\) were deleted – was 1.056.
All in all, results suggest that the imputation methods work sufficiently well.
4.8 Imputation of two-level Poisson distributed data
mice.impute.2l.poisson()
imputes missing data based on a generalized linear mixed effects Poisson model. Note that the function requires the data to be in long format, i.e. the repeated measurements are stacked upon another. Additional time and id variables indicate measurement timepoint (when the data are panel data) and cluster membership. The function is described in detail in Kleinke & Reinecke (2015a). To specify the imputation model, we use the same coding scheme as other two-level imputation functions in mice
. Allowed entries in the predictorMatrix
are given in Table 4.7:
numeric code | meaning |
---|---|
0 | variable is not included in the imputation model |
1 | variable is included as a fixed effect |
2 | variable is included as a fixed and random effect |
-2 | class variable. |
The intercept is always treated as fixed and random factor. If this is not desired, the .noint
variant of the function could be used, which treats the intercept only as a fixed, but not as a random effect (see help("mice.impute.2l.poisson")
).
4.9 Imputation of overdispersed two-level count data
Method 2l.nb
imputes incomplete data based on a two-level NB model (for details, see Kleinke & Reinecke, 2015b). In version 2 of package countimp
, we have replaced package glmmADMB
by glmmTMB
to estimate the two-level NB model. In our practical experience, estimation by glmmTMB
is faster and more stable. Note that the function requires data in long format and uses the same coding scheme as other two-level imputation functions in (cf. van Buuren & Groothuis-Oudshoorn, 2011). The specification of the imputation method and model works analogous to other multilevel imputation functions in mice
, see for example help("mice.impute.2l.pan")
. Allowed entries in the predictorMatrix
are listed in Table 4.8.
numeric code | meaning |
---|---|
0 | variable is not included in the imputation model |
1 | variable is included as a fixed effect |
2 | variable is included as a fixed and random effect |
-2 | class variable. |
4.9.1 Monte Carlo Simulation
We tested the refurbished version of mice.impute.2l.nb
by means of Monte Carlo simulation (with 200 replications). The design was very simular to the one used by Kleinke & Reinecke (2015b).
We simulated data of \(g=50\) groups of varying group sizes \(n_j\), \(j\in (1,2,\dots,g)\).
\(n_j\) were drawn randomly from all integer numbers between 100 and 400. The data of the 50 groups were generated separately and then stacked upon each other to create data sets in long format. Each data set consisted of one continuous individual level predictor \(x\), one continuous group level predictor \(z\), and the overdispersed dependent count variable \(y\). The intercept and the slope of \(x\) were treated as random factors (i.e. they were allowed to vary across the 50 groups). We set the following population parameters: \(\beta_0=1\), the fixed intercept term, \(\beta_1=.75\), the coefficient of the individual level predictor \(x\), \(\beta_2=.5\), the coefficient of the group level predictor \(z\), and \(\theta=2\), the overdispersion parameter. Random effects standard deviations were \(.5\) and \(.3\) for the intercept term (\(\tau_{00}\)) and the slope of \(x\) (\(\tau_{11}\)) respectively. The random effects correlation (\(\tau_{01}\)) was set to \(0\). \(x_j\), \(z_j\), and the individual level residuals \(e_j\) were simulated from \(\mathcal{N}(0,1)\) normal populations. Random effects were also drawn from normal \(\mathcal{N}(0,\tau)\) populations. Data were generated as follows.
We first obtained parameters
\[\begin{equation*} {\beta_{j}}={\Gamma}{Z_{j}}+{u_{j}} \end{equation*}\]where
\[\begin{equation*} {\beta_{j}}=\begin{bmatrix} {\beta_{0j}} \\ {\beta_{1j}} \end{bmatrix}, {\Gamma} = \begin{bmatrix} \gamma_{00} & \gamma_{01} \\ \gamma_{10} & \gamma_{11} \\ \end{bmatrix} = \begin{bmatrix} 1 & .5 \\ 0.75 & 0 \\ \end{bmatrix}, {Z_j}=\begin{bmatrix} 1 \\ z_{j} \end{bmatrix}, \text{and } {u_{j}}=\begin{bmatrix} {u_{0j}} \\ {u_{1j}} \end{bmatrix}. \end{equation*}\]For each individual \(i\), the dependent count variable \(y_{ij}\) was then drawn from \(y_{ij}\sim\text{NB}(\mu_{ij}=\exp(X_{ij}\beta_j),\theta=2)\), using the rnegbin()
function from R package MASS
, where \(X_{ij}=\begin{bmatrix}1&x_{ij}\end{bmatrix}\) .
In the next step, we introduced missing data in \(y\). Predictors \(x\), and \(z\) remained fully observed. We introduced missing data according to the following rule, where \(p_{ij}\) is the missingness probability and NA denotes a missing value (i.e. “not available”):
\[\begin{align*} p_{ij}&=\text{invlogit}(-1+z_{ij}) \\ v_{ij}&\sim \mathcal{U}(0,1) \\ y_{ij} &= \text{NA, } \text{if } v_{ij}<p_{ij}. \end{align*}\]This created an average percentage of missing data in \(y\) of 30.2% (\(SD=3.0\%\)).
The incomplete variable \(y\) was then imputed \(m=5\) times using both the Bayesian regression and bootstrap regression versions of the function. The imputation model included variables \(x\) and \(z\). The intercept and the slope of \(x\) were treated as random factors. We used the mice
default settings and monitored convergence as outlined in van Buuren & Groothuis-Oudshoorn (2011). The data sets were then analyzed by the same NB2 model that was used for data imputation. Results were combined using Rubin’s formula (Rubin, 1987). Results are displayed in Tables 4.9 and 4.10. The tables display (a) the average parameter estimate across the 200 replications, (b) bias, which is defined as the difference between the true parameter and the average parameter estimate, (c) %Bias, which is bias, devided by the true parameter multiplied by 100%, and 95% coverage (i.e. the percentage of confidence intervals that include the true parameter).
Parameter | \(\beta_0\) | \(\beta_1\) | \(\beta_2\) | \(\sigma_{00}\) | \(\sigma_{11}\) | \(\sigma_{01}\) | \(\theta\) |
---|---|---|---|---|---|---|---|
Q | 1.000 | 0.750 | 0.500 | 0.500 | 0.300 | 0.000 | 2.000 |
\(\widehat{Q}\) | 0.996 | 0.748 | 0.503 | 0.471 | 0.294 | -0.088 | 1.961 |
Bias | 0.004 | 0.002 | -0.003 | 0.029 | 0.006 | 0.088 | 0.039 |
%Bias | 0.400 | 0.267 | -0.600 | 5.800 | 2.000 | Inf | 1.950 |
Coverage Rate | 93.500 | 95.500 | 94.500 | NA | NA | NA | NA |
Parameter | \(\beta_0\) | \(\beta_1\) | \(\beta_2\) | \(\sigma_{00}\) | \(\sigma_{11}\) | \(\sigma_{01}\) | \(\theta\) |
---|---|---|---|---|---|---|---|
Q | 1.000 | 0.750 | 0.500 | 0.500 | 0.300 | 0.000 | 2.000 |
\(\widehat{Q}\) | 0.996 | 0.748 | 0.504 | 0.472 | 0.295 | -0.102 | 1.977 |
Bias | 0.004 | 0.002 | -0.004 | 0.028 | 0.005 | 0.102 | 0.023 |
%Bias | 0.400 | 0.267 | -0.800 | 5.600 | 1.667 | Inf | 1.150 |
Coverage Rate | 94.000 | 95.500 | 93.500 | NA | NA | NA | NA |
As can be seen, the average parameter estimates are close to the defined population quantities, %Bias is below 10% and coverage is above 90% for all parameters, which means that the imputation function works nicely when (like in this case), model assumptions are met.
4.10 Imputation of zero-inflated two-level count data
Methods "2l.hp"
and "2l.hnb"
impute zero-inflated two-level count data based on a two-level hurdle model. Analogously, methods "2l.zip"
and "2l.zinb"
create imputations based on a two-level zero-inflated Poisson or NB model respectively. The .noint
variants of the functions treat the intercept only as a fixed, but not as a random effect. It may be specified, if the intercept is excluded from the random part of the zero model (.noint.zero
), the random part of the count model (.noint.count
), or from the random part of both models (.noint.both
). For details about all model variants, see help("mice.impute.2l.hnb"
Different sets of predictors can be specified for the zero and count parts of the model. Allowed entries in the predictorMatrix
are listed in Table 4.11:
numeric code | meaning |
---|---|
0 | variable is not included in the imputation model |
1 | variable is included as a fixed effect |
2 | variable is included as a fixed and random effect |
3 | variable is included as a fixed effect (count model only) |
4 | variable is included as a fixed and random effect (count model only) |
5 | variable is included as a fixed effect (zero model only) |
6 | variable is included as a fixed and random effect (zero model only) |
-2 | class variable |
4.10.1 Monte Carlo Simulation
We evaluated these new or updated imputation procedures for zero-inflated two-level data by means of Monte Carlo simulation (with 200 replications). The design was similar to the one used by Kleinke & Reinecke (2015b): We simulated data of \(g=50\) groups of sample sizes \(n_j=100\), \(j \in 1\dots,g\). The data of the 50 groups were generated separately and then stacked upon each other to create data sets in long format. Each data set consisted of one continuous individual level predictor \(x\), and the overdispersed dependent count variable \(y\). \(x\) both influenced the zero and the count process. The intercept and the slope of \(x\) were treated as random in both the zero and the count part of the model. We set the following population parameters: \(\beta_0=1\), and \(\beta_1=.75\) the fixed intercept and slope terms respectively of the count part of the model, and \(\gamma_0=.0\), and \(\gamma_1=.5\) the fixed intercept and slope terms in the zero part of the model. Random effects standard deviations were \(.5\) and \(.3\) for the intercept term (\(\tau_{00}\)) and the slope of \(x\) (\(\tau_{11}\)) respectively. The random effects correlation (\(\tau_{01}\)) was set to \(0\). \(x_{ij}\) were simulated from a \(\mathcal{N}(0,1)\) normal population. Random effects \(u\) were also drawn from normal \(\mathcal{N}(0,\tau)\) populations. Data were generated as follows. We first obtained parameters for the zero part and the count part of the model. For each group \(j\), \(j\in{1,\dots,g}\) we simulated parameters:
\[\begin{align*} \gamma_{0j}&=0+u_{0jz}\\ \gamma_{1j}&=.5+u_{1jz}\\ \beta_{0j}&=1+u_{0jc}\\ \beta_{1j}&=.75+u_{1jc} \end{align*}\]The probability of belonging to the zero versus non-zero category was then determined by
\[\begin{equation*} \mu_{ij}=\text{invlogit}(\gamma_{0j}+\gamma_{1j}x_{1i}) \end{equation*}\]We then drew random uniform \(z\sim\mathcal{U}(0,1)\) numbers and if \(z_{ij}<\mu_{ij}\) then the dependent variable \(y\) was set to 0, and 1 otherwise. In the next step, a nonzero observation in \(y\) was replaced by a random draw from a zero-truncated Poisson distribution with mean \(\exp(\beta_{0j}+\beta_{1j}x_{1ij})\) in the Poisson condition, and by a random draw from a zero-truncated NB distribution with mean \(\exp(\beta_{0j}+\beta_{1j} x_{1ij})\) and overdispersion parameter \(\theta=2\) in the NB condition.
Then, we introduced missing data in \(y\). Predictor \(x\) remained fully observed. We introduced missing data according to the following rule, where \(p_{ij}\) is the missingness probability and NA denotes a missing value (i.e. “not available”):
\[\begin{align*} p_{ij}&=\text{invlogit}(-1+x_{ij}) \\ v_{ij}&\sim \mathcal{U}(0,1) \\ y_{ij} &= \text{NA, } \text{if } v_{ij}<p_{ij}. \end{align*}\]This created an average percentage of missing data in \(y\) of 30.4%.
The incomplete variable \(y\) was then imputed \(m=5\) times using both the Bayesian regression and bootstrap regression versions of the respective function (hurdle Poisson imputation in the Poisson condition, and hurdle NB imputation in the negative binomial condition). \(x\) was included as predictor for the zero and count part of the model. The intercept and the slope of \(x\) were treated as random factors (again in both parts of the model). We used the mice
default settings and monitored convergence (for details, see Chapter 5). The \(m\) completed data sets were then analyzed by the same model that was used for data imputation. Results were combined using Rubin’s formula (Rubin, 1987).
Results of hurdle Poisson imputation are displayed in Tables 4.12 and 4.13, results of hurdle NB imputation are displayed in Tables 4.14 and 4.15. The tables display the average parameter estimate across the 200 replications, the standard deviation of the estimates across the replications, bias, percent bias and 95% coverage.
\(Q\) | \(\widehat{Q}\) | Bias | %Bias | Coverage Rate | |
---|---|---|---|---|---|
\(\beta_0\) | 1.00 | 1.001 | -0.001 | -0.100 | 94.924 |
\(\beta_1\) | 0.75 | 0.746 | 0.004 | 0.533 | 93.909 |
\(\gamma_0\) | 0.00 | -0.006 | 0.006 | Inf | 92.893 |
\(\gamma_1\) | 0.50 | 0.492 | 0.008 | 1.600 | 96.447 |
\(\sigma_{00c}\) | 0.50 | 0.491 | 0.009 | 1.800 | NA |
\(\sigma_{11c}\) | 0.30 | 0.283 | 0.017 | 5.667 | NA |
\(\sigma_{01c}\) | 0.00 | -0.008 | 0.008 | Inf | NA |
\(\sigma_{00z}\) | 0.50 | 0.477 | 0.023 | 4.600 | NA |
\(\sigma_{11z}\) | 0.30 | 0.271 | 0.029 | 9.667 | NA |
\(\sigma_{01z}\) | 0.00 | -0.053 | 0.053 | Inf | NA |
\(Q\) | \(\widehat{Q}\) | Bias | %Bias | Coverage Rate | |
---|---|---|---|---|---|
\(\beta_0\) | 1.00 | 1.011 | -0.011 | -1.100 | 94.416 |
\(\beta_1\) | 0.75 | 0.748 | 0.002 | 0.267 | 96.447 |
\(\gamma_0\) | 0.00 | -0.006 | 0.006 | Inf | 91.878 |
\(\gamma_1\) | 0.50 | 0.495 | 0.005 | 1.000 | 91.878 |
\(\sigma_{00c}\) | 0.50 | 0.446 | 0.054 | 10.800 | NA |
\(\sigma_{11c}\) | 0.30 | 0.261 | 0.039 | 13.000 | NA |
\(\sigma_{01c}\) | 0.00 | -0.127 | 0.127 | Inf | NA |
\(\sigma_{00z}\) | 0.50 | 0.439 | 0.061 | 12.200 | NA |
\(\sigma_{11z}\) | 0.30 | 0.269 | 0.031 | 10.333 | NA |
\(\sigma_{01z}\) | 0.00 | -0.108 | 0.108 | Inf | NA |
\(Q\) | \(\widehat{Q}\) | Bias | %Bias | Coverage Rate | |
---|---|---|---|---|---|
\(\beta_0\) | 1.00 | 0.993 | 0.007 | 0.700 | 93.939 |
\(\beta_1\) | 0.75 | 0.735 | 0.015 | 2.000 | 92.424 |
\(\gamma_0\) | 0.00 | -0.003 | 0.003 | Inf | 93.939 |
\(\gamma_1\) | 0.50 | 0.497 | 0.003 | 0.600 | 96.970 |
\(\sigma_{00c}\) | 0.50 | 0.480 | 0.020 | 4.000 | NA |
\(\sigma_{11c}\) | 0.30 | 0.276 | 0.024 | 8.000 | NA |
\(\sigma_{01c}\) | 0.00 | -0.059 | 0.059 | Inf | NA |
\(\sigma_{00z}\) | 0.50 | 0.485 | 0.015 | 3.000 | NA |
\(\sigma_{11z}\) | 0.30 | 0.274 | 0.026 | 8.667 | NA |
\(\sigma_{01z}\) | 0.00 | -0.054 | 0.054 | Inf | NA |
\(\theta\) | 2.00 | 2.098 | -0.098 | -4.900 | NA |
\(Q\) | \(\widehat{Q}\) | Bias | %Bias | Coverage Rate | |
---|---|---|---|---|---|
\(\beta_0\) | 1.00 | 0.996 | 0.004 | 0.400 | 92.929 |
\(\beta_1\) | 0.75 | 0.743 | 0.007 | 0.933 | 93.939 |
\(\gamma_0\) | 0.00 | -0.001 | 0.001 | Inf | 93.434 |
\(\gamma_1\) | 0.50 | 0.500 | 0.000 | 0.000 | 95.455 |
\(\sigma_{00c}\) | 0.50 | 0.440 | 0.060 | 12.000 | NA |
\(\sigma_{11c}\) | 0.30 | 0.262 | 0.038 | 12.667 | NA |
\(\sigma_{01c}\) | 0.00 | -0.151 | 0.151 | Inf | NA |
\(\sigma_{00z}\) | 0.50 | 0.444 | 0.056 | 11.200 | NA |
\(\sigma_{11z}\) | 0.30 | 0.268 | 0.032 | 10.667 | NA |
\(\sigma_{01z}\) | 0.00 | -0.127 | 0.127 | Inf | NA |
\(\theta\) | 2.00 | 1.991 | 0.009 | 0.450 | NA |
Estimates are sufficiently accurate. It appears that the bootstrap variants performed slightly worse than the Bayesian variants, especially regarding the random part of the model. This might be caused by the block bootstrap approach implemented here (we re-sample clusters rather than individual cases). This approach requires a substantial number of clusters. Future research needs to establish under which scenarios it is better to use the Baysian variants of the functions and when to use the bootstrap variants. Little is yet known about level-1 and level-2 sample size requirements.