How to approach a social science research problem when you have data and a couple different ways you could proceed?

tl;dr: Someone asks me a question, I can’t really tell what he’s talking about, so I offer some generic advice.

Joe Hoover writes:

An issue has come up in my subsequent analyses, which uses my MrsP estimates to explore the relationship between county-level moral values and the county-level distribution of hate groups, as defined by the SPLC.

Setting aside issues of spatial auto-correlation, control variables, measurement, and all other potential complications, I want to explore the US county-level association between a county mean outcome X and the county-level distribution of rare-event Y (N Y = 0 is about 2800, N Y > 0 is about 250).

My initial analytical plan included two analyses:

1. Model Y as some zero inflated function of X. I tried this and observed a lot of noise (small effects with estimated with low uncertainty).

2. Employ a case-control design that includes all hate group counties + a random sample of counties without hate groups. This design is based on a recent paper that investigated the county-level distribution of hate groups. When I tried this approach, estimation uncertainty decreased and the effects were in the hypothesized direction (how convenient!).

My issue now is that I have two very different sets of results that rely on two very different designs. It seems to me that they address two different questions, but am not entirely sure what question the second analysis really addresses:

1. If we know X for a given county, does that tell us anything about the expected rate of hate groups in that county. Answer: no.

2. Among counties that…mostly have at least one hate group, does knowing X tell us anything about how the expected rate of hate groups in that county. Answer: yes?

Part of my confusion about how to work with these results derives from the complexity of the DGP: there are probably many counties that would be nice places to start a hate group, but maybe…there are no self-motivated bigots there. Or, the bigots there are introverted and don’t like to be in groups, etc.

I guess I’m thinking of these factors as something analogous to epidemiological exposure. For example, perhaps county-level population density increases the risk contracting a virus at the county level. But, if the virus is rare, estimating a model that includes every county won’t reveal this relationship because most counties were never exposed.

This kind of epidemiological reasoning makes sense to me, but it is outside of my areas of expertise. And, I am also aware that it is probably not a coincidence that the reasoning which justifies the ‘good’ results ‘makes sense’ to me.

Accordingly, I would like to place myself on firmer ground by better understanding the precedents for these different analytical approaches. Specifically, I would like to know if it ever makes sense to use a case-control approach if you have data for the entire world (i.e. in my case, case-control requires throwing out observations, which feels strange). Also, I would like to have a better idea of how to interpret these kind of results.

My reply:

I’m getting confused on the details here so let me try to step back and answer in the abstract. He’s fitting two completely different models to the same data . . . hmmmm, not quite the same data, more like two takes on the same problem.

Thinking about fundamentals . . . I was taught that, when stuck, we should think about statistical problems as prediction problems, with causal inference corresponding to prediction under various potential outcomes. So that’s what I’d do here. Instead of saying that you want to “explore the relationship between county-level moral values and the county-level distribution of hate group,” try to define a more precise question (WWJD), then some of the answers will flow.