(This article was originally published at Statistics – Error Statistics Philosophy, and syndicated at StatsBlogs.)

There’s something about “Principle 2” in the ASA document on p-values that I couldn’t address in my brief commentary, but is worth examining more closely.

*2. P-values do not measure (a) the probability that the studied hypothesis is true , or (b) the probability that the data were produced ** by random chance alone,*

(a) is true, but what about (b)? That’s what I’m going to focus on, because I think it is often misunderstood. It was discussed earlier on this blog in relation to the **Higgs experiments** and deconstructing “the probability the results are ‘statistical flukes'”. So let’s examine:

*2(b) P-values do not measure the probability that the data were produced ** by random chance alone,*

*We assume here that the p-value is not invalidated by either biasing selection effects or violated statistical model assumptions.*

The basis for **2(b)** is the denial of a claim we may call **claim (1):**

**Claim (1**):** A small p-value indicates it’s improbable that the results are due to chance alone as described in H_{0.}**

Principle 2(b) asserts that claim (1) is false. Let’s look more closely at the different things that might be meant in teaching or asserting (1) . ** How can we explain the common assertion of claim (1)? ** Say there is a one-sided test:

*H*

_{0}: μ = 0 vs.

*H*

_{1}:μ > 0 (Or, we could have

*H*

_{0}: μ < 0 ).

**Explanation #1:** A person asserting claim (1) is using an informal notion of probability that is common in English. They mean a small p-value gives grounds (or is evidence) that *H*_{1}:μ > 0. **Under this reading there is no fallacy.**

*Comment*: If *H*_{1} has passed a stringent test, a standard principle of inference is to infer *H*_{1} is warranted. An informal notion of:

“So probably”

H_{1}

is *merely qualifying the grounds upon which we assert evidence for* *H*_{1}. When a method’s error probabilities are used to qualify the grounds on which we assert the result of using the method, it is not to assign a posterior probability to a hypothesis. It is important not to confuse informal notions of probability and likelihood in English with technical, formal ones.

**Explanation #2: **A person asserting claim (1) is interpreting the p-value as a posterior probability of null hypothesis *H*_{0 }based on a prior probability distribution: p = Pr(*H*_{0 }|*x*). **Under this reading there is a fallacy.**

*Comment*: Unless the p-value tester has explicitly introduced a prior, this would be a most ungenerous interpretation of what is meant. Given that significance testing is part of a methodology that is directed to provide statistical inference methods whose validity does not depend on a prior probability distribution, it would be implausible to think a teacher of significance tests would mean a Bayesian posterior is warranted. Moreover, since a formal posterior probability assigned to a hypothesis doesn’t signal *H*_{1} has been well-tested (as opposed to,say, it’s strongly believed), it seems an odd construal of what a tester means in asserting (1). The informal construal in explanation #1, is far more plausible.

A third explanation further illuminates why some assume this fallacious reading is intended.

**Explanation #3: **A person asserting claim (1) intends an ordinary error probability.** Letting ***d( X) be the test statistic:*

Pr(Test T produces

d(X)>d();xH_{0}) ≤ p.

(Note the definition of the p-value in my comment on the ASA statement.)

**Notice: H_{0} does not say the observed results are due to chance.** It is just

*H*

_{0}:μ = 0.

*H*

_{0}

*entails*the observed results are due to chance, but that is different.

**Under this reading there is no fallacy.**

*Comment: *R.A. Fisher was clear that we need not isolated significant results “but a reliable method of procedure” (see my commentary). We may suppose the tester follows Fisher and the test T consists of a pattern of statistically significant results indicating the effect. The probability that we’d be able to generate {*d( X)*

__>__

*d(*)} in these experiments, in a world described by

**x***H*

_{0}, is very low (p). Equivalently:

Pr(Test T producesP-value <p;H_{0}) = p

The probability test T generates such impressively small p-values under the assumption they are due to chance alone is very small, p. Equivalently, a universe adequately described by *H*_{0 }would produce such impressively small p-values only p(100)% of the time. Or yet another way:

Pr(Test T would

notregularly produce such statistically significant results; were we in a world whereH_{0}) = 1-p

*Severity and the detachment of inferences*

Admittedly, the move to inferring evidence of a non-chance discrepancy requires an additional principle of evidence that I have been calling the severity principle (SEV). Perhaps the weakest form is to a statistical rejection or falsification of the null.

Datax_{0 }from a test T provide evidence for rejecting H_{0}(just) to the extent that H_{0}would (very probably) have survived, were it a reasonably adequate description of the process generating the data (with respect to the question).

It is also captured by a general frequentist principle of evidence (FEV) (Mayo and Cox 2010), a variant on the general idea of severity (SEV) (EGEK 1996, Mayo and Spanos 2006, etc.).

The *severity principle*, put more generally:

Data from a test T (generally understood as a group of individual tests)provide good evidence for inferring H (just) to the extent that H passes severely withx_{0}, i.e., to the extent that H would (very probably) not have survived the test so well were H false.

*Here H would be the rather weak claim of some discrepancy, but specific discrepancy sizes can (and should) be evaluated by the same means. *

* Conclusion*. The only explanation under which claim (1) is a fallacy is the non-generous explanation #2. Thus, I would restrict principle 2 to 2(a). That said, I’m not claiming 2(b) is the ideal way to construe p-values. In fact, without being explicit about the additional principle that permits linking to the inference (the principle I call severity), it is open to equivocation. I’m just saying it’s typically meant as an ordinary error probability [2].

**Souvenir**: Don’t merely repeat what you hear about statistical methods (from any side) but, rather, think it through yourself.

Comments are welcome.[1]

Mayo, D. G. and Cox, D. R. (2006), “Frequentists Statistics as a Theory of Inductive Inference,” in *Optimality: The Second Erich L. Lehmann Symposium*, ed. J. Rojo, Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

Mayo, D. G. and Spanos, A. (2006), “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction,” *British Journal for the Philosophy of Science* 57(2): 323–57.

My comment, “Don’t throw out the error control baby with the bad statistics bathwater is #17 under the supplementary materials:

[1] I have this old Monopoly game from my father that contains metal pieces like this top hat. There’s also a racing car, a thimble and more.

[2] The error probabilities come from the sampling distribution and are often said to be “”hypothetical”. I see no need to repeat “hypothetical” in alluding to error probabilities.

Filed under: P-values, statistical tests, Statistics

**Please comment on the article here:** **Statistics – Error Statistics Philosophy**