Another Regression Discontinuity Disaster and what can we learn from it

As the above image from Diana Senechal illustrates, a lot can happen near a discontinuity boundary.

Here’s a more disturbing picture, which comes from a recent research article, “The Bright Side of Unionization: The Case of Stock Price Crash Risk,” by Jeong-Bon Kim, Eliza Xia Zhang, and Kai Zhong:

which I learned about from the following email:

On Jun 18, 2019, at 11:29 AM, ** wrote:

Hi Professor Gelman,

This paper is making the rounds on social media:

Look at the RDD in Figure 3 [the above two graphs]. It strikes me as pretty weak and reminds me a lot of your earlier posts on the China air pollution paper. Might be worth blogging about?

If you do, please don’t cite this email or my email address in your blog post, as I would prefer to remain anonymous.

Thank you,

This anonymity thing comes up pretty often—it seems that there’s a lot of fear regarding the consequences of criticizing published research.

Anyway, yeah this is bad news. The discontinuity at the boundary looks big and negative, in large part because the fitted curves have a large positive slope in that region, which in turn seems to be driven by action on the boundary of the graph which is essentially irrelevant to the causal question being asked.

It’s indeed reminiscent of this notorious example from a few years ago:

Screen Shot 2013-08-03 at 4.23.29 PM

And, as before, it’s stunning not just that the researchers made this mistake—after all, statistics is hard, and we all make mistakes—but that they could put a graph like the ones above directly into their paper and not realize the problem.

This is not a case of the chef burning the steak and burying it in a thick sauce. It’s more like the chef taking the burnt slab of meat and serving it with pride—not noticing its inedibility because . . . the recipe was faithfully applied!

What happened?

Bertrand Russell has this great quote, “This is one of those views which are so absurd that only very learned men could possibly adopt them.” On the other hand, there’s this from George Orwell: “To see what is in front of one’s nose needs a constant struggle.”

The point is that the above graphs are obviously ridiculous—but all these researchers and journal editors didn’t see the problem. They’d been trained to think that if they followed certain statistical methods blindly, all would be fine. It’s that all-too-common attitude that causal identification plus statistical significance equals discovery and truth. Not realizing that both causal identification and statistical significance rely on lots of assumptions.

The estimates above are bad. They can either be labeled as noisy (because the discontinuity of interest is perturbed by this super-noisy curvy function) or as biased (because in the particular case of the data the curves are augmenting the discontinuity by a lot). At a technical level, these estimates give overconfident confidence intervals (see this paper with Zelizer and this one with Imbens), but you hardly need all that theory and simulation to see the problem—just look at the above graphs without any ideological lenses.

Ideology—statistical ideology—is important here, I think. Researchers have this idea that regression discontinuity gives rigorous causal inference, and that statistical significance gives effective certainty, and that the rest is commentary. These attitudes are ridiculous, but we have to recognize that they’re there.

The authors do present some caveats but these are a bit weak for my taste:

Finally, we acknowledge the limitations of the RDD and alert readers to be cautious when generalizing our inferences in different contexts. The RDD exploits the local variation in unionization generated by union elections and compares crash risk between the two distinct samples of firms with the close-win and close-loss elections. Thus, it can have strong local validity, but weak external validity. In other words, the negative impact of unionization on crash risk may be only applicable to firms with vote shares falling in the close vicinity of the threshold. It should be noted, however, that in the presence of heterogeneous treatment effect, the RDD estimate can be interpreted as a weighted average treatment effect across all individuals, where the weights are proportional to the ex ante likelihood that the realized assignment variable will be near the threshold (Lee and Lemieux 2010). We therefore reiterate the point that “it remains the case that the treatment effect estimated using a RD design is averaged over a larger population than one would have anticipated from a purely ‘cutoff’ interpretation” (Lee and Lemieux 2010, 298).

I agree that generalization is a problem, but I’m not at all convinced that what they’ve found applies even to their data. Again, a big part of their negative discontinuity estimate is coming from that steep up-sloping curve which seems like nothing more than an artifact.

How to better analyze these data?

To start with, I’d like to see a scatterplot. According to the descriptive statistics there are 687 data points, so the above graph must be showing binned averages or something like that. Show me the data!

Next, accept that this is an observational study, comparing companies that did or did not have unions. These two groups of companies differ in many ways, one of which is the voter share in the union election. But there are other differences too. Throwing them all in a regression will not necessarily do a good job of adjusting for all these variables.

The other thing I don’t really follow are their measures of stock price crash risk. These seem like pretty convoluted definitions; there must be lots of ways to measure this, at many time scales. This is a problem with the black-box approach to causal inference, but I’m not sure how this aspect of the problem could be handled better. The trouble is that stock prices are notoriously noisy, so it’s not like you could have a direct model of unionization affecting the prices—even beyond the obvious point that unionization, or the lack thereof, will have different effects in different companies. But if you go black-box and look at some measure of stock prices as an outcome, then the results could be sensitive to how and when you look at them. These particular measurement issues are not our first concern here—as the above graphs demonstrate, the estimation procedure being used here is a disaster—but if you want to study the problem more seriously, I’m not at all clear that looking at stock prices in this way will be helpful.

Larger lessons

Again, I’d draw a more general lesson from this episode, and others like it, that when doing science we should be aware of our ideologies. We’ve seen so many high-profile research articles in the past few years that have had such clear and serious flaws. On one hand it’s a social failure: not enough eyes on each article, nobody noticing or pointing out the obvious problems.

But, again, I also blame the reliance on canned research methods. And I blame pseudo-rigor, the idea that some researchers have that their proposed approach is automatically correct. And, yes, I’ve seen that attitude among Bayesians too. Rigor and proof and guarantee are fine, and they all come with assumptions. If you want the rigor, you need to take on the assumptions. Can’t have one without the other.

Finally, in case there’s a question that I’m being too harsh on an unpublished paper: If the topic is important enough to talk about, it’s important enough to criticize. I’m happy to get criticisms of my papers, published and unpublished. Better to have mistakes noticed sooner rather than later.