(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Joachim Krueger writes:

As many of us rely (in part) on p values when trying to make sense of the data, I am sending a link to a paper Patrick Heck and I published in Frontiers in Psychology. The goal of this work is not to fan the flames of the already overheated debate, but to provides some estimates about what p can and cannot do. Statistical inference will always require experience and good judgment regardless of which school of thought (Bayesian, frequentist, or other) we are leaning to.

I have three reactions.

**1.** I don’t think there’s any “overheated debate” about the p-value; it’s a method that has big problems and is part of the larger problem that is null hypothesis significance testing (see my article, The problems with p-values are not just with p-values); also p-values are widely misunderstood (see also here).

From a Bayesian point of view, p-values are most cleanly interpreted in the context of uniform prior distributions—but the setting of uniform priors, where there’s nothing special about zero, is the scenario where p-values are generally irrelevant.

So I don’t have much use for p-values. They still get used in practice—a lot—so there’s room for lots more articles explaining them to users, but I’m kinda tired of the topic.

**2.** I disagree with Krueger’s statement that “statistical inference will always require experience and good judgment.” For better or worse, lots of statistical inference is done using default methods by people with poor judgment and little if any relevant experience. Too bad, maybe, but that’s how it is.

Does statistical inference require experience and good judgment? No more than driving a car requires experience and good judgment. All you need is gas in the tank and the key in the ignition and you’re ready to go. The roads have all been paved and anyone can drive on them.

**3.** In their article, Krueger and Heck write, “Finding p = 0.055 after having found p = 0.045 does not mean that a bold substantive claim has been refuted (Gelman and Stern, 2006).” Actually, our point was much bigger than that. Everybody knows that 0.05 is arbitrary and there’s no real difference between 0.045 and 0.055. Our point was that apparent huge differences in p-values are not actually stable (“statistically significant”). For example, a p-value of 0.20 is considered to be useless (indeed, it’s often taken, erroneously, as evidence of no effect), and a p-value of 0.01 is considered to be strong evidence. But a p-value of 0.20 corresponds to a z-score of 1.28, and a p-value of 0.01 corresponds to a z-score of 2.58. The difference is 1.3, which is not close to statistically significant. (The difference between two independent estimates, each with standard error 1, has a standard error of sqrt(2); thus a difference in z-scores of 1.3 is actually less than 1 standard error away from zero!) So I fear that, by comparing 0.055 to 0.045, they are minimizing the main point of our paper.

More generally I think that all the positive aspects of the p-value they discuss in their paper would be even more positive if researchers were to use the z-score and not ever bother with the misleading transformation into the so-called p-value. I’d much rather see people reporting z-scores of 1.5 or 2 or 2.5 than reporting p-values of 0.13, 0.05, and 0.01.

The post Ride a Crooked Mile appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**