--- title: "Using Multiple Hot Deck Data Sets for Inference" author: "Skyler Cranmer, Jeff Gill, Natalie Jackson, Andreas Murr, David A. Armstrong II, Simon Heuberger" output: rmarkdown::html_vignette: citation_package: natbib number_sections: true bibliography: hot.deck.bib biblio-style: apsr date: "`r format(Sys.time(), '%B %d, %Y')`" vignette: > %\VignetteIndexEntry{Using Multiple Hot Deck Data Sets for Inference} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, echo=FALSE, include=FALSE} options(useFancyQuotes=FALSE, width=100) ``` This document will walk you through some of the methods you could use to generate pooled model results that account for both sampling variability and across imputation variability. The package `hot.deck` does not come with a set of functions to do inference, so we will show you how you could use the data generated by `hot.deck` in combination with `glm.mids` (and similarly `lm.mids`) from the `mice` package, `zelig` from the `Zelig` package and by using `MIcombine` from the `mitools` package on a list of model objects. # Generating Imputations The data we will use come from @Poeetal1999 dealing with democracy and state repression. First we need to call the `hot.deck` routine on the dataset. ```{r echo=T, include=T} library(hot.deck) data(isq99) out <- hot.deck(isq99, sdCutoff=3, IDvars = c("IDORIGIN", "YEAR")) ``` This shows us that there are still 45 observations with fewer than 5 donors. Using a different method or further widening the `sdCutoff` parameter may alleviate the problem. If you want to see the frequency distribution of the number of donors, you could look at: ```{r numdonors, echo=T, include=T} numdonors <- sapply(out$donors, length) numdonors <- sapply(out$donors, length) numdonors <- ifelse(numdonors > 5, 6, numdonors) numdonors <- factor(numdonors, levels=1:6, labels=c(1:5, ">5")) table(numdonors) ``` Before running a model, three variables have to be created from those existing. Generally, if variables are deterministic functions of other variables (e.g., transformations, lags, etc...) it is advisable to impute the constituent variables of the calculations and then do the calculations after the fact. Here, we need to lag the `AI` variable and create percentage change variables for both population and per-capita GNP. First, to create the lag of `AI`, `PCGNP` and `LPOP`. To do this, we will make a little function. ```{r tscslag, echo=T, include=T} tscslag <- function(dat, x, id, time){ obs <- apply(dat[, c(id, time)], 1, paste, collapse=".") tm1 <- dat[[time]] - 1 lagobs <- apply(cbind(dat[[id]], tm1), 1, paste, collapse=".") lagx <- dat[match(lagobs, obs), x] } for(i in 1:length(out$data)){ out$data[[i]]$lagAI <- tscslag(out$data[[i]], "AI", "IDORIGIN", "YEAR") out$data[[i]]$lagPCGNP <- tscslag(out$data[[i]], "PCGNP", "IDORIGIN", "YEAR") out$data[[i]]$lagLPOP <- tscslag(out$data[[i]], "LPOP", "IDORIGIN", "YEAR") } ``` Now, we can use the lagged values of `PCGNP` and `LPOP`, to create percentage change variables: ```{r pcgchange, echo=T, include=T} for(i in 1:length(out$data)){ out$data[[i]]$pctchgPCGNP <- with(out$data[[i]], c(PCGNP-lagPCGNP)/lagPCGNP) out$data[[i]]$pctchgLPOP <- with(out$data[[i]], c(LPOP-lagLPOP)/lagLPOP) } ``` ## Using MIcombine You can use the `MIcombine` command from the `mitools` package to generate inferences, too. Here, you have to produce a list of model estimates and the function will combine across the different results. ```{r mods, echo=T, include=T} # initialize list out <- hd2amelia(out) results <- list() # loop over imputed datasets for(i in 1:length(out$imputations)){ results[[i]] <- lm(AI ~ lagAI + pctchgPCGNP + PCGNP + pctchgLPOP + LPOP + MIL2 + LEFT + BRIT + POLRT + CWARCOW + IWARCOW2, data=out$imputations[[i]]) } summary(mitools::MIcombine(results)) ``` ## Using mids The final method for combining results is to convert the data object returned by the `hot.deck` function to an object of class `mids`. This can be done with the `datalist2mids` function from the `miceadds` package. ```{r conv, echo=T, include=T} out.mids <- miceadds::datalist2mids(out$imputations) s <- summary(mice::pool(mice::lm.mids(AI ~ lagAI + pctchgPCGNP + PCGNP + pctchgLPOP + LPOP + MIL2 + LEFT + BRIT + POLRT + CWARCOW + IWARCOW2, data=out.mids))) print(s, digits=4) ``` # References