It’s not just p=0.048 vs. p=0.052

Peter Dorman points to this post on statistical significance and p-values by Timothy Taylor, editor of the Journal of Economic Perspectives, a highly influential publication of the American Economic Association.

I have some problems with what Taylor writes, but for now I’ll just take it as representing a certain view, the perspective of a thoughtful and well-meaning social scientist who is moderately informed about statistics and wants to be helpful.

Here I’ll just pull out one quote, which points to a common misperception about the problems of p-values, or what might be called “the p-value transformation,” which takes an estimate and standard error and transforms it to a tail-area probability relative to a null hypothesis. Taylor writes:

[G]iven the realities of real-world research, it seems goofy to say that a result with, say, only a 4.8% probability of happening by chance is “significant,” while if the result had a 5.2% probability of happening by chance it is “not significant.” Uncertainty is a continuum, not a black-and-white difference.

First, I don’t know why he conditions on “the realities of real-world research” here. Even in idealized research, the p-value is a random variable, and it would be goofy to draw a sharp line between p = 0.048 and p = 0.052, just as it would be goofy to draw a sharp line between z-scores of 1.94 and 1.98.

To formalize this slightly, “goofy” = “not an optimal decision rule or close to an optimal decision rule under any plausibly reasonable utility function.”

Also, to get technical for a moment, the p-value is not the “probability of happening by chance.” But we can just chalk that up to a casual writing style.

My real problem with the above-quoted statement is not the details of wording but rather that I think it represents a mistake in emphasis.

This business of 0.048 vs. 0.052, or 0.04 vs. 0.06, etc.: I hear it a lot as a criticism of p-values, and I think it misses the point. If you want a bright-line rule, you need some threshold. There’s no big difference between 18 years old, and 17 years and 364 days old, but if you’re in the first situation you get to vote, and if you’re in the second situation you don’t. That doesn’t mean that there should be no age limit on voting.

No, my problem with the 0.048 vs. 0.052 thing is that it way, way, way understates the problem.

Yes, there’s no stable difference between p = 0.048 and p = 0.052.

But there’s also stable difference between p = 0.2 (which is considered non-statistically significant by just about everyone) and p = 0.005 (which is typically considered very strong evidence.)

Just look at the z-scores:

> qnorm(1 - c(0.2, 0.005)/2)
[1] 1.28 2.81

The (two-sided) p-values of 0.2 and 0.005 correspond to z-scores of 1.3 and 2.8. That is, a super-duper-significant p = 0.005 is only 1.53 standard errors higher than an ignore-it-pal-there’s-nothing-going-on p = 0.2.

But it’s even worse than that. If these two p-values come from two identical experiments, then the standard error of their difference is sqrt(2) times the standard error of each individual estimate, hence that difference in p-values itself is only (2.81 – 1.28)/sqrt(2) = 1.1 standard errors away from zero.

To say it again: it is completely consistent with the null hypothesis to see p-values of 0.2 and 0.005 from two replications of the same damn experiment.

So. Yes, it seems goofy to draw a bright line between p = 0.048 and p = 0.052. But it’s also goofy to draw a bright line between p = 0.2 and p = 0.005. There’s a lot less information in these p-values than people seem to think.

So, when we say that the difference between “significant” and “not significant” is not itself statistically significant, “we are not merely making the commonplace observation that any particular threshold is arbitrary—for example, only a small change is required to move an estimate from a 5.1% significance level to 4.9%, thus moving it into statistical significance. Rather, we are pointing out that even large changes in significance levels can correspond to small, nonsignificant changes in the underlying quantities.”