# Regularized Prediction and Poststratification (the generalization of Mister P)

May 19, 2018
By

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

This came up in comments recently so I thought I’d clarify the point.

Mister P is MRP, multilevel regression and poststratification. The idea goes like this:

1. You want to adjust for differences between sample and population. Let y be your outcome of interest and X be your demographic and geographic variables you’d like to adjust for. Assume X is discrete so you can define a set of poststratification cells, j=1,…,J (for example, if you’re poststratifying on 4 age categories, 5 education categories, 4 ethnicity categories, and 50 states, then J=4*5*4*50, and the cells might go from 18-29-year-old no-high-school-education whites in Alabama, to over-65-year-old, post-graduate-education latinos in Wyoming). Each cell j has a population N_j from the census.

2. You fit a regression model y | X to data, to get a predicted average response for each person in the population, conditional on their demographic and geographic variables. You’re thus estimating theta_j, for j=1,…,J. The {\em regression} part of MRP comes in because you need to make these predictions.

3. Given point estimates of theta, you can estimate the population average as sum_j (N_j*theta_j) / sum_j (N_j). Or you can estimate various intermediate-level averages (for example, state-level results) using partial sums over the relevant subsets of the poststratification cells.

4. In the Bayesian version (for example, using Stan), you get a matrix of posterior simulations, with each row of the matrix representing one simulation draw of the vector theta; this then propagates to uncertainties in any poststrat averages.

5. The {\em multilevel} part of MRP comes because you want to adjust for lots of cells j in your poststrat, so you’ll need to estimate lots of parameters theta_j in your regression, and multilevel regression is one way to get stable estimates with good predictive accuracy.

OK, fine. The point is: poststratification is key. It’s all about (a) adjusting for many ways in which your sample isn’t representative of the population, and (b) getting estimates for population subgroups of interest.

But it’s not crucial that the theta_j’s be estimated using multilevel regression. More generally, we can use any {\em regularized prediction} method that gives reasonable and stable estimates while including a potentially large number of predictors.

Hence, regularized prediction and poststratification. RPP. It doesn’t sound quite as good as MRP but it’s the more general idea.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

Tags: , ,

 Tweet

Email: