“A small p-value indicates it’s improbable that the results are due to chance alone” –fallacious or not? (more on the ASA p-value doc)

March 12, 2016

(This article was originally published at Statistics – Error Statistics Philosophy, and syndicated at StatsBlogs.)



There’s something about “Principle 2” in the ASA document on p-values that I couldn’t address in my brief commentary, but is worth examining more closely.

2. P-values do not measure (a) the probability that the studied hypothesis is true , or (b) the probability that the data were produced  by random chance alone,

(a) is true, but what about (b)? That’s what I’m going to focus on, because I think it is often misunderstood. It was discussed earlier on this blog in relation to the Higgs experiments and deconstructing “the probability the results are ‘statistical flukes'”. So let’s examine:

2(b) P-values do not measure the probability that the data were produced  by random chance alone,

We assume here that the p-value is not invalidated by either biasing selection effects or violated statistical model assumptions.

The basis for 2(b) is the denial of a claim we may call claim (1):

Claim (1): A small p-value indicates it’s improbable that the results are due to chance alone as described in H0.

Principle 2(b) asserts that claim (1) is false. Let’s look more closely at the different things that might be meant in teaching or asserting (1) . How can we explain the common assertion of claim (1)?  Say there is a one-sided test: H0: μ = 0 vs. H1:μ > 0 (Or, we could haveH0: μ < 0 ).

Explanation #1: A person asserting claim (1) is using an informal notion of probability that is common in English. They mean a small p-value gives grounds (or is evidence) that H1:μ > 0. Under this reading there is no fallacy.

Comment: If H1 has passed a stringent test, a standard principle of inference is to infer H1  is warranted. An informal notion of:

“So probably” H1

is merely qualifying the grounds upon which we assert evidence for H1. When a method’s error probabilities are used to  qualify the grounds on which we assert the result of using the method, it is not to assign a posterior probability to a hypothesis. It is important not to confuse informal notions of probability and likelihood in English with technical, formal ones.


Explanation #2: A person asserting claim (1) is interpreting the p-value as a posterior probability of null hypothesis H0 based on a prior probability distribution: p = Pr(H0 |x). Under this reading there is a fallacy.

Comment: Unless the p-value tester has explicitly introduced a prior, this would be a most ungenerous interpretation of what is meant. Given that significance testing is part of a methodology that is directed to provide statistical inference methods whose validity does not depend on a prior probability distribution, it would be implausible to think a teacher of significance tests would mean a Bayesian posterior is warranted. Moreover, since a formal posterior probability assigned to a hypothesis doesn’t signal H1 has been well-tested (as opposed to,say, it’s strongly believed), it seems an odd construal of what a tester means in asserting (1). The informal construal in explanation #1, is far more plausible.

A third explanation further illuminates why some assume this fallacious reading is intended.


Explanation #3: A person asserting claim (1) intends an ordinary error probability. Letting d(X) be the test statistic:

Pr(Test T produces d(X)>d(x); H0) ≤  p.

(Note the definition of  the p-value in my comment on the ASA statement.)

Notice: H0 does not say the observed results are due to chance. It is just H0:μ = 0. H0 entails the observed results are due to chance, but that is different. Under this reading there is no fallacy.

Comment: R.A. Fisher was clear that we need not isolated significant results “but a reliable method of procedure” (see my commentary). We may suppose the tester follows Fisher and the test T consists of a pattern of statistically significant results indicating the effect. The probability that we’d be able to generate {d(X) > d(x)} in these experiments, in a world described by H0, is very low (p). Equivalently:

Pr(Test T produces P-value < p; H0) = p

The probability test T generates such impressively small p-values under the assumption they are due to chance alone is very small, p. Equivalently, a universe adequately described by H0 would produce such impressively small p-values only p(100)% of the time. Or yet another way:

Pr(Test T would not regularly produce such statistically significant results; were we in a world where H0 ) = 1-p

Severity and the detachment of inferences

Admittedly, the move to inferring evidence of a non-chance discrepancy requires an additional principle of evidence that I have been calling the severity principle (SEV). Perhaps the weakest form is to a statistical rejection or falsification of the null.

Data x from a test T provide evidence for rejecting H0 (just) to the extent that H0 would (very probably) have survived, were it a reasonably adequate description of the process generating the data (with respect to the question).

It is also captured by a general frequentist principle of evidence (FEV) (Mayo and Cox 2010), a variant on the general idea of severity (SEV) (EGEK 1996, Mayo and Spanos 2006, etc.).

The severity principle, put more generally:

Data from a test  T (generally understood as a group of individual tests) provide good evidence for inferring H (just) to the extent that H passes severely with x0, i.e., to the extent that H would (very probably) not have survived the test so well were H false.

Here H would be the rather weak claim of some discrepancy, but specific discrepancy sizes can (and should) be evaluated by the same means.

Conclusion. The only explanation under which claim (1) is a fallacy is the non-generous explanation #2. Thus, I would restrict principle 2 to 2(a). That said, I’m not claiming 2(b) is the ideal way to construe p-values. In fact, without being explicit about the additional principle that permits linking to the inference (the principle I call severity), it is open to equivocation. I’m just saying it’s typically meant as an ordinary error probability [2].

Souvenir: Don’t merely repeat what you hear about statistical methods (from any side) but, rather, think it through yourself.

Comments are welcome.[1]


Mayo, D. G. and Cox, D. R. (2006), “Frequentists Statistics as a Theory of Inductive Inference,” in Optimality: The Second Erich L. Lehmann Symposium, ed. J. Rojo, Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. G. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction,” British Journal for the Philosophy of Science 57(2): 323–57.

My comment, “Don’t throw out the error control baby with the bad statistics bathwater is #17 under the supplementary materials:

[1] I have this old Monopoly game from my father that contains metal pieces like this top hat. There’s also a racing car, a thimble and more.

[2] The error probabilities come from the sampling distribution and are often said to be “”hypothetical”. I see no need to repeat “hypothetical” in alluding to error probabilities.

Filed under: P-values, statistical tests, Statistics

Please comment on the article here: Statistics – Error Statistics Philosophy

Tags: , ,