(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)
Juli writes:
I’m helping a professor out with an analysis, and I was hoping that you might be able to point me to some relevant literature… She has two studies that have been completed already (so we can’t go back to the planning stage in terms of sampling, unfortunately). Both studies are based around the population of adults in LA who attended LA public high schools at some point, so that is the same for both studies. Study #1 uses random digit dialing, so I consider that one to be SRS. Study #2, however, is a convenience sample in which all participants were involved with one of eight community-based organizations (CBOs).
Of course, both studies can be analyzed independently, but she was hoping for there to be some way to combine/compare the two studies. Specifically, I am working on looking at the civic engagement of the adults in both studies. In study #1, this means looking at factors such as involvement in student government. In study #2, this means looking at involvement in CBOs…but they were all involved in those.
I know I can’t blindly combine the two studies. I also know that not having a control group (i.e., not in CBOs) in study #2 is a problem, as is the convenience sampling, but I can’t change those things. I was trying to see if I could somehow use study #1 (or part of it – participants who look similar based on a variety of factors) to act as the control group for study #2 and do some sort of matching, but I’m not sure that’s okay. Then I was trying to see if I could combine the studies and act as though they are different strata, one with SRS and one with quota sampling (I think – per Lohr’s book, chapter on stratified sampling). But I’m still not sure if it’s okay to compare them that way.
I know that overall, generalizability is going to be nearly impossible here. But it would be really nice to come up with a creative way to make this work. I have a sneaking suspicion that this might be useful for others – which then made me wonder if this has been tackled before. Any thoughts?
My reply:
It’s funny this comes up, because we were just having a discussion on the blog with a student at UCLA who was asking about the use of hierarchical models for causal inference, combining different data sources.
My generic advice is to set up a regression model controlling for as many background variables as possible, then it’s possible that within each poststrat cell, the two groups can be considered to be equivalent to a natural experiment in which one group is involved with the CBO and the other isn’t. Since you can’t control for everything, the next step is to include in the model an unobserved variable representing unknown differences (that is, selection effects). How exactly to do this, though, I don’t know. On this subject, I’m all talk and no action.
My more constructive suggestion would be to talk with Jennifer or, since you’re at UCLA, to Sander Greenland in the epidemiology department. This sort of thing is right up his alley.
Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science
