Conditioning on post-treatment variables when you expect self-selection

Sadish Dhakal writes:

I am struggling with the problem of conditioning on post-treatment variables. I was hoping you could provide some guidance. Note that I have repeated cross sections, NOT panel data. Here is the problem simplified:

There are two programs. A policy introduced some changes in one of the programs, which I call the treatment group (T). People can select into T. In fact there’s strong evidence that T programs become more popular in the period after policy change (P). But this is entirely consistent with my hypothesis. My hypothesis is that high-quality people select into the program. I expect that people selecting into T will have better outcomes (Y) because they are of higher quality. Consider the specification (avoiding indices):

Y = b0 + b1 T + b2 P + b3 T X P + e (i)

I expect that b3 will be positive (which it is). Again, my hypothesis is that b3 is positive only because higher quality people select into T after the policy change. Let me reframe the problem slightly (And please correct me if I’m reframing it wrong). If I could observe and control for quality Q, I could write the error term e = Q + u, and b3 in the below specification would be zero.

Y = b0 + b1 T + b2 P + b3 T X P + Q + u (ii)

My thesis is not that the policy “caused” better outcomes, but that it induced selection. How worried should I be about conditioning on T? How should I go about avoiding bogus conclusions?

My reply:

Best would be if you can simply observe Q and include it in the model. If that’s not possible, get some estimate of Q, some pre-treatment measure of pre-treatment quality. Label that measure as z. Then you can fit a measurement-error model:

Y = b0 + b1 T + b2 P + b3 T X P + Q + u (ii.a)
z = Q + error (ii.b)

Your inferences will be sensitive to your model for the error in z: the higher the error, the further it is from your ideal model (ii) above. But that’s life. You gotta make assumptions somewhere.

To put it another way, you should proceed on two fronts:
1. Data,
2. Modeling.
Get better data on the selection process, and model it too. Both data and model are important.