Hey! Here’s what to do when you have two or more surveys on the same population!

This problem comes up a lot: We have multiple surveys of the same population and we want a single inference. The usual approach, applied carefully by news organizations such as Real Clear Politics and Five Thirty Eight, and applied sloppily by various attention-seeking pundits every two or four years, is “poll aggregation”: you take the estimate from each poll separately, if necessary correct these estimates for bias, then combine them with some sort of weighted average.

But this procedure is inefficient and can lead to overconfidence (see discussion here, or just remember the 2016 election).

A better approach is to pool all the data from all the surveys together. A survey response is a survey response! Then when you fit your model, include indicators for the individual surveys (varying intercepts, maybe varying slopes too), and include that uncertainty in your inferences. Best of both worlds: you get the efficiency from counting each survey response equally, and you get an appropriate accounting of uncertainty from the multiple surveys.

OK, you can’t always do this: To do it, you need all the raw data from the surveys. But it’s what you should be doing, and if you can’t, you should recognize what you’re missing.

The post Hey! Here’s what to do when you have two or more surveys on the same population! appeared first on Statistical Modeling, Causal Inference, and Social Science.