Forking paths plus lack of theory = No reason to believe any of this.

December 29, 2017

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

[image of a cat with a fork]

Kevin Lewis points us to this paper which begins:

We use a regression discontinuity design to estimate the causal effect of election to political office on natural lifespan. In contrast to previous findings of shortened lifespan among US presidents and other heads of state, we find that US governors and other political office holders live over one year longer than losers of close elections. The positive effects of election appear in the mid-1800s, and grow notably larger when we restrict the sample to later years. We also analyze heterogeneity in exposure to stress, the proposed mechanism in the previous literature. We find no evidence of a role for stress in explaining differences in life expectancy. Those who win by large margins have shorter life expectancy than either close winners or losers, a fact which may explain previous findings.

All things are possible but . . . Jesus, what a bunch of forking paths. Forking paths plus lack of theory = No reason to believe any of this.

Just to clarify: Yes, there’s some theory in the paper, kinda, but it’s the sort of theory that Jeremy Freese describes as “more vampirical than empirical—unable to be killed by mere evidence” because any of the theoretical explanations could go in either direction (in this case, being elected could be argued to increase or decrease lifespan, indeed one could easily make arguments for the effect being positive in some scenarios and negative in others): the theory makes no meaningful empirical predictions.

I’m not saying that theory is needed to do social science research: There’s a lot of value in purely descriptive work. But if you want to take this work as purely descriptive, you have to deal with the problems of selection bias and forking paths inherent in reporting demographic patterns that are the statistical equivalent of noise-mining statements such as, “The Dodgers won 9 out of their last 13 night games played on artificial turf.”

On the plus side, the above-linked article includes graphs indicating how weak and internally contradictory the evidence is. So if you go through the entire paper and look at the graphs, you should get a sense that there’s not much going on here.

But if you just read the abstract and you don’t know about all the problems with such studies, you could really get fooled.

The thing that people just don’t get is that is just how easy it is to get “p less than .01” using uncontrolled comparisons. Uri Simonsohn explains in his post, “P-hacked Hypotheses Are Deceivingly Robust,” along with a story of odd numbers and the horoscope.

It’s our fault

Statistics educators, including myself, have to take much of the blame for this sad state of affairs.

We go around sending the message that it’s possible to get solid causal inference from experimental or observational data, as long as you have a large enough sample size and a good identification strategy.

People such as the authors of the above article then take us at our word, gather large datasets, find identification strategies, and declare victory. The thing we didn’t say in our textbooks was that this approach doesn’t work so well in the absence of clean data and strong theory. In the example discussed above, the data are noisy—lots and lots of things affect lifespan in much more important ways than whether you win or lose an election—and, as already noted, the theory is weak and doesn’t even predict a direction of the effect.

The issue is not that “p less than .01” is useless—there are times when “p less than .01” represents strong evidence—but rather that this p-value says very little on its own.

I suspect it would be hard to convince the authors of the above paper that this is all a problem, as they’ve already invested a lot of work in this project. But I hope that future researchers will realize that, without clean data and strong theory, this sort of approach to scientific discovery doesn’t quite work as advertised.

And, again, I’m not saying the claim in that published paper is false. What I’m saying is I have no idea: it’s a claim coming out of nowhere for which no good evidence has been supplied. The claim’s opposite could just as well be true. Or, to put it more carefully, we can expect any effect to be highly variable, situation-dependent, and positive in some settings and negative in others.

P.S. I suspect this sort of criticism is demoralizing for many researchers—not just those involved in the particular article discussed above—because forking paths and weak theory are ubiquitous in social research. All I can say is: yeah, there’s a lot of work out there that won’t stand up. That’s what the replication crisis is all about! Indeed, the above paper could usefully be thought of as an example of a failed replication: if you read carefully, you’ll see that the result that was found was weak, noisy, and in the opposite direction of what was expected. In short, a failed replication. But instead of presenting it as a failed replication, the authors presented it as a discovery. After all, that’s what academic researchers are trained to do: turn those lemons into lemonade!

Anyway, yeah, sorry for the bad news, but that’s the way it goes. The point of the replication crisis is that you have to start expecting that huge swaths of the social science literature won’t replicate. Just cos a claim is published, don’t think that implies there’s any serious evidence behind it. “Hey, it could be true” + “p less than 0.05” + published in a journal (or, in this case, posted on the web) is not enough.

The post Forking paths plus lack of theory = No reason to believe any of this. appeared first on Statistical Modeling, Causal Inference, and Social Science.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

Tags: , ,