countimp
version 2
Kristian Kleinke and Jost Reinecke
2019-06-27
Abstract
Count data are non-negative integer values and give the frequency of occurrence of a certain event or behavior within a given timespan. Count data are usually not normally distributed but are often skewed and require special analysis and imputation techniques. Yet, most of the currently available multiple imputation packages are very limited with regard to count data. Thecountimp
package provides easy to use multiple imputation (MI) procedures for incomplete count data based on either a Bayesian regression approach or on a bootstrap regression approach within a chained equations MI framework. Our software extends the functionality of the popular and powerful mice
package in R (van Buuren & Groothuis-Oudshoorn, 2011). The current version of countimp
supports ordinary count data imputation under the Poisson model, imputation of incomplete overdispersed count data under either the quasi-Poisson or the negative binomial model, imputation of zero-inflated ordinary or overdispersed count data based on a zero-inflated Poisson or negative binomial model, or a hurdle model, and imputation of multilevel count data based on generalized linear mixed effects count models (overdispersion and zero-inflation are supported). Additionally, we provide a predictive mean matching (PMM) variant, based on a two-level model, which might be used, when count data are not too heavily skewed (for an evaluation of flat-file count data imputation by PMM, see for example Kleinke, 2017).
Chapter 1 Overview
Count data require special analysis techniques (for a comprehensive overview, see Hilbe, 2011): Ordinary count data can be analyzed by a classical Poisson regression model. Overdispersed count data, i.e., count data that have a larger variance in comparison to their mean are usually modeled by either quasi-Poisson or negative binomial (NB) models (for a distinction of true and apparent overdispersion, see Hilbe, 2011). Zero-inflated count data, i.e., count data that contain more zero counts than would be expected by either the Poisson, quasi-Poisson or NB model, can be analyzed by either zero-inflated Poisson or NB models, or by hurdle Poisson or NB models (see Zeileis, Kleiber, & Jackman, 2008 for an overview). Furthermore, generalized linear mixed effects versions of these models can be fit to cater for a clustered structure of the data.
Missing data methods and especially multiple imputation procedures for count data, however, are currently still very scarce. In practice, the missing data problem in count variables is often handled by (a) ignoring that the data are counts and by proceeding as if they were continuous, (b) by treating the data as (ordered) categorical. We regard these solutions as quick fixes (that may work in some settings), but not as an optimal and general solution to adequately analyze incomplete count data and get precise and unbiased parameter estimates as well as standard errors in a wide variety of scenarios. See for example Yu, Burton, & Rivero-Arias (2007) or Kleinke (2017) for situations, where imputation models based on the normal model fail. Furthermore, see Chapter 3 for an evaluation of the strategy to treat count data as if they were ordered categorical.
Based on research by Yu et al. (2007) and Kleinke (2017), we recommend to use imputation methods with fitting parametrical assumptions.
Currently, the following packages (apart from countimp
) allow to handle missing data based on various count models:
IVEware
, which is available as a SAS add-on or a standalone version can create imputations based on an ordinary Poisson model.- R package
mi
supports flat-file count data imputation. Zero-inflation models, hurdle models, and multilevel count models are not supported. ice
for STATA supports count data imputation under a Poisson or NB model.
This document is structured as follows:
- Chapter 2 gives information about missing data and multiple imputation in general, and familiarizes readers with typical count data models.
- In Chapter 3, we illustrate by means of Monte Carlo simulation, why it may not be a good idea to treat counts as if they were either continuous or (ordered) categorical.
- In Chapter 4, we describe package
countimp
in detail and provide practical examples as well as Monte Carlo simulations to assess both the quality of the newly introduced and the refurbished algorithms from Version 2 of packagecountimp
.
An evaluation of the algorithms from Version 1 of package countimp
may furthermore be found in:
What is new in version 2?
- We have included support for both hurdle and zero-inflation models for both flat-file and multilevel data sets (only two-level models are currently supported).
- We have replaced package
glmmADMB
withglmmTMB
to fit the two-level NB, zero-inflation, and hurdle models, which in our experience runs more stable. - We have adapted our functions to the new
mice
architecture (countimp
version 1 only works withmice
version 2.35 or older). - We have included automatic handling of outliers in the imputed values (argument
EV=TRUE
). Note that extreme imputations might indicate that there are problems with the imputation method, model and / or the algorithm. If extreme imputations are detected, the respective imputations are replaced by imputations obtained bymice.impute.midastouch()
, a predective mean matching (PMM) variant, described in Gaffert, Meinfelder, & Bosch (2016). PMM is usually a good allround imputation method that works in may scenarios (van Buuren, 2012). - We have included a two-level predictive mean matching algorithm. PMM can be enumerated among the hot deck procedures or k-nearest neighbour procedures and imputes an actual observed value. Based on predictions by a normal two-level linear mixed effects model, an observed case (donor) is matched to an incomplete case and the actual observed value replaces the missing one. By imputing only observed values, PMM is able (up to a certain degree) to preserve the original distribution of variable at hand and can buffer mild to moderate violations of model assumptions. On the downside, PMM cannot make any extrapolations beyond the observed range of the remaining cases and might not be a good method in situations, where the distribution of the observed and the assumed distribution of the unobserved cases are highly distinct. A description of the method, as well as results of a first Monte Carlo simulation may be obtained from https://www.kkleinke.com/files/slides/2016-09-dgps.pdf.
- all imputation functions are availabe als Bayesian regression and bootstrap regression variants (see Chapter 2.1.4). These are two different methods to introduce between-imputation variability. The flat-file bootstrap functions re-sample individual observations, the two-level functions re-sample clusters. Note that this requires a substantial level-2 sample size. If this is not the case, it is advisable to use the Bayesian variant instead. For further information, see the Monte Carlo simulation in Chapter 4.10.