A Python program for multivariate missing-data imputation that works on large datasets!?

January 11, 2018

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Alex Stenlake and Ranjit Lall write about a program they wrote for imputing missing data:

Strategies for analyzing missing data have become increasingly sophisticated in recent years, most notably with the growing popularity of the best-practice technique of multiple imputation. However, existing algorithms for implementing multiple imputation suffer from limited computational efficiency, scalability, and capacity to exploit complex interactions among large numbers of variables. These shortcomings render them poorly suited to the emerging era of “Big Data” in the social and natural sciences.

Drawing on new advances in machine learning, we have developed an easy-to-use Python program – MIDAS (Multiple Imputation with Denoising Autoencoders) – that leverages principles of Bayesian nonparametrics to deliver a fast, scalable, and high-performance implementation of multiple imputation. MIDAS employs a class of unsupervised neural networks known as denoising autoencoders, which are capable of producing complex, fine-grained reconstructions of partially corrupted inputs. To enhance their accuracy and speed while preserving their complexity, these networks leverage the recently developed technique of Monte Carlo dropout, changing their output from a frequentist point estimate into the approximate posterior of a Gaussian process. Preliminary tests indicate that, in addition to successfully handling large datasets that cause existing multiple imputation algorithms to fail, MIDAS generates substantially more accurate and precise imputed values than such algorithms in ordinary statistical settings.

They continue:

Please keep in mind we’re writing for a political science/IPE audience, where listwise deletion is still common practice. The “best-practice” part should be fairly evident among your readership…in fact, it’s probably just considered “how to build a model”, rather than a separate step.

And here are some details:

Our method is “black box”-y, using neural networks and denoising autoencoders to approximate a Gaussian process. We call it MIDAS -Multiple Imputation with Denoising AutoencoderS – because you have to have a snappy name. Using a neural network has drawbacks – it’s a non-interpretable system which can’t give additional insight into the data generation process. On the upside, it means you can point it at truly enormous datasets and yield accurate imputations in relatively short (by MCMC standards) time. For example, we ran the entire CCES (66k x ~ 2k categories) as one of our test cases in about an hour. We’ve also built in an overimputation method for checking model complexity and the ability of the model to reconstruct known values. It’s not a perfect “return of the full distribution of missing values”, but point estimates of error for deliberately removed values, giving a good estimate and allowing the avoidance of any potential overtraining through early stopping. A rough sanity check, if you will. This is an attempt to map onto the same terrain covered in your blog post.

Due to aggressive regularisation and application of deep learning techniques, it’s also resistant to overfitting in overspecified models. A second test loosely follows the method outlined in Honaker and King 2012, taking the World Development Indicators, subsetting out a 31 year period for 6 African nations, and then lagging all complete indicators a year either side. We then remove a single value of GDP, run a complete imputation model and compare S=200 draws of the approximate posterior to the true value. In other words, the data feed is a 186 x ~3.6k matrix, suffering hugely from both collinearity and irrelevant input, and it still yields quite accurate results. It should be overfitting hell, but it’s not. We’d make a joke about the Bayesian LASSO, but I’m actually not sure. Right now we think its a combination of data augmentation and sparsity driven by input noise. Gal’s PhD thesis is the basis for this algorithm, and this conception of a sparsity-inducing prior could just be a logical extension of his idea. Bayesian deep learning is all pretty experimental right now.

We’ve got a github alpha release of MIDAS up, but there’s still a long way to go before it gets close to Stan’s level of flexibility. Right now, it’s a fire-and-forget algorithm designed to be simple and fast. Let’s be frank – it’s not a replacement for conventional model-based imputation strategies. We’d still trust in a fully specified generative model which can incorporate explicit information about the missingness-generating mechanism over my own for bespoke models. Our aim is to build a better solution than listwise deletion for the average scholar/data scientist, that can reliably handle the sorts of nonlinear patterns found in real datasets. Looking at the internet, most users – particularly data scientists – aren’t statisticians who have the time or energy to go the full-Bayes route.

Missing data imputation is tough. You want to include lots of predictors and a flexible model, so regularization is necessary, but it’s only recently that we—statisticians and users of statistics—have become comfortable with regularization. Our mi package, for example, uses only very weak regularization, and I think we’ll need to start from scratch when thinking about imputation. As background, here are 3 papers we wrote on generic missing data imputation: http://www.stat.columbia.edu/~gelman/research/published/MI_manuscript_RR.pdf and http://www.stat.columbia.edu/~gelman/research/published/mipaper.pdf and http://www.stat.columbia.edu/~gelman/research/published/paper77notblind.pdf .

I encourage interested readers to check out Stenlake and Lall’s program and see how it works. Or, if people have any thoughts on the method, feel free to share these thoughts in comments. I’ve not looked at the program or the documentation myself; I’m just sharing it here because the description above sounds reasonable.

The post A Python program for multivariate missing-data imputation that works on large datasets!? appeared first on Statistical Modeling, Causal Inference, and Social Science.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

Tags: , , ,