National Academies of Science: Please Correct Your Definitions of P-values

Mayo banging head

If you were on a committee to highlight issues surrounding P-values and replication, what’s the first definition you would check? Yes, exactly. Apparently, when it came to the recently released National Academies of Science “Consensus Study” Reproducibility and Replicability in Science 2019, no one did.

This Consensus Study was prompted by concerns about the reproducibility and replicability of scientific research. …To carry out the task, the National Academies appointed a committee of 15 members representing a wide range of expertise.(NAS Consensus Study xvi)

I. Use the correct definition of P-value, distinguish likelihood and probability.   I limit myself to their remarks on statistical significance tests.

“Because hypothesis testing has been involved in a major portion of reproducibility and replicability assessments, we consider this mode of statistical inference in some detail.” (p.34) Unfortunately, they don’t give us the essential details, and what they give us contains flaws. Let me annotate what they say:

(1) Scientists use the term null hypothesis to describe the supposition that there is no difference between the two intervention groups or no effect of a treatment on some measured outcome (Fisher, 1935). (2) A standard statistical test aims to answers the question: If the null hypothesis is true, what is the likelihood of having obtained the observed difference? (3) In general, the greater the observed difference, the smaller the likelihood it would have occurred by chance when the null hypothesis is true. (4) This measure of the likelihood that an obtained value occurred by chance is called the p-value. (NAS Consensus Study p. 34)

Remarks:

(1) This limits the null hypothesis H0 to the “nil” or point null–an artificial restriction at the heart of many problems.

(2) It would be wrong to say the “aim” of a standard statistical test is getting a P-value–even if they did correctly define P-value, which they don’t. In fact, they incorrectly define it everywhere in the book, which is baffling. The aim, or an aim, is to distinguish signal from noise, or genuine effects from random error, or the like–in relation to a reference hypothesis (test hypothesis). The P-value is the probability (not the likelihood) of a difference as large or larger than the observed d0 under the assumption that the null hypothesis H0 is true. Any observed result d0 will be improbable in some respect. So if you declared evidence of a genuine effect whenever the observed difference was improbable under H0, you’d have an extremely high Type I error probability (if not 1).

By looking at the P-value, Pr(d ≥ d0;H0), we reason, if even larger differences than d0 occur fairly frequently under H0 (the P-value is not small), there’s scarcely evidence of incompatibility with H0. Small P-values indicate a genuine discrepancy from (or incompatibility with) H0, but isolated small P-values don’t suffice as evidence of genuine experimental effects (as Fisher stresses). (See this post). [i]

(3) This is OK, but “likelihood” is a technical term and should not be used as a synonym for probable in any discussion trying to clarify terms. Doing so just begs for confusion and transposition fallacies. For example, frequentists will assign likelihoods, but not probabilities, to statistical hypotheses.

In case you thought (2) was a slip, the error is repeated in (4):

(4) This measure of the likelihood that an obtained value occurred by chance is called the p-value.

NO. This is wrong. So I return to my question: Wouldn’t this be the first thing you looked at if you were serving on this committee?

II. Consensus? Again, P-values and likelihood. After that wobbly introduction to statistical tests, this Consensus Document turns to remarks from the 2019 American Statistical Association (ASA) editorial by Wasserstein, R., Schirm, A. and Lazar, N. (2019)(ASA II). Unlike the 2016 Statement on P-values, ASA I, its authors are clear that ASA II not a consensus document, but rather, is “open to debate”. NAS Consensus Study does not note this qualification, although they do not go as far as ASA II in declaring the concept of statistical significance be banished.

More recently, it has been argued that p-values, properly calculated and understood, can be informative and useful; however, a conclusion of statistical significance based on an arbitrary threshold of likelihood (even a familiar one such as p ≤ 0.05) is unhelpful and frequently misleading (Wasserstein et al., 2019) (NAS Consensus Study)  [ii]

Now any prespecified “threshold” for statistical significance is “arbitrary”, according to ASA II, so it’s not clear how the two parts of this sentence cohere. Let’s agree that the attained P-value should always be reported. It doesn’t follow that taking into account whether it satisfies a preset value, say 0.005, is misleading. Moreover, thresholds can be intelligently chosen, e.g., to reflect meaningful population effect sizes. (See my recent “P-value thresholds: Forfeit at your peril“)

The NAS Consensus Study continues with the following, which again I’ll annotate:

(5) In some cases, it may be useful to define separate interpretive zones, where p-values above one significance threshold are not deemed significant, p-values below a more stringent significance threshold are deemed significant, and p-values between the two thresholds are deemed inconclusive. (6) Alternatively, one could simply accept the calculated p-value for what it is—the likelihood of obtaining the observed result if the null hypothesis were true—and refrain from further interpreting the results as “significant” or “not significant.” (NAS Consensus Study, 36)

Remarks:

(5) This first part is a good idea, and is in sync with how Neyman and Pearson (N-P) first set out tests, with 3 regions. ASA II, however, is opposed to trichotomy.

[T]he problem is not that of having only two labels. Results should not be trichotomized, or indeed categorized into any number of groups. (ASA II)

(6) No. There they go again. The P-value is not the probability of obtaining the observed result if the null hypothesis were true. And please stop saying “likelihood” when you mean “probability”. They are not the same. It might be fine in informal discussions, but not in guides to avoid fallacies.

The Consensus Study considers different ways to ascertain successful replication.

 CONCLUSION 5-2: A number of parametric and nonparametric methods may be suitable for assessing replication across studies. However, a restrictive and unreliable approach would accept replication only when the results in both studies have attained “statistical significance,” that is, when the p-values in both studies have exceeded a selected threshold. Rather, in determining replication, it is important to consider the distributions of observations and to examine how similar these distributions are. (NAS Consensus Study, p. 74)

They do not show that accepting replication so long as “distributions of observations” are deemed “similar” in some sense (left vague) is more reliable than requiring results attain prespecified P-value thresholds–at least if the assessment of unreliability includes increases in false positives as well as false negatives [iii]. Of course testing thresholds need to be intelligently chosen, with regard for variability, indicated magnitude of discrepancy (of the initial study), and power of the tests to detect various discrepancies.

III.  Some Good Points: Data dependent subgroups and double counting. There are plenty of important points throughout the Consensus Study. I mention just two. First, they tell the famous story with Richard Peto and post-data subgroups.

Misuse of statistical testing often involves post hoc analyses of data already collected, making it seem as though statistically significant results provide evidence against the null hypothesis, when in fact they may have a high probability of being false positives…. A study from the late-1980s gives a striking example of how such post hoc analysis can be misleading. The International Study of Infarct Survival was a large-scale, international, randomized trial that examined the potential benefit of aspirin for patients who had had a heart attack. After data collection and analysis were complete, the publishing journal asked the researchers to do additional analysis to see if certain subgroups of patients benefited more or less from aspirin. Richard Peto, one of the researchers, refused to do so because of the risk of finding invalid but seemingly significant associations. In the end, Peto relented and performed the analysis, but with a twist: he also included a post hoc analysis that divided the patients into the twelve astrological signs, and found that Geminis and Libras did not benefit from aspirin, while Capricorns benefited the most (Peto, 2011). This obviously spurious relationship illustrates the dangers of analyzing data with hypotheses and subgroups that were not prespecified (p.97)

Then there’s a note prohibiting “double counting” data:

A fundamental principle of hypothesis testing is that the same data that were used to generate a hypothesis cannot be used to test that hypothesis (de Groot, 2014). In confirmatory research, the details of how a statistical hypothesis test will be conducted must be decided before looking at the data on which it is to be tested. When this principle is violated, significance testing, confidence intervals, and error control are compromised. Thus, it cannot be assured that false positives are controlled at a fixed rate. In short, when exploratory research is interpreted as if it were confirmatory research, there can be no legitimate statistically significant result. (NAS Consensus Study)

Strictly speaking, there are cases where error control can be retained despite apparently violating this principle (often called the requirement of “use novelty” in philosophy). It will depend on what’s meant by “same data”. For example, data can be used in generating a hypothesis about how a statistical assumption can fail, and also be used in testing that assumption. However, “the data” are remodelled to ask a different question, and error control can be retained. (See SIST Excursion 4, tours II and III. (iv))

IV. Issue an errata for the P-value definitions. The NAS Consensus Study only just came out; issuing a correction now will avoid a new generation of incorrect understandings of P-values.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[i] For a severe Tester’s justification of P-values, see Souvenir C of SIST (Mayo 2018, CUP): A Severe Tester’s Translation Guide.

[ii] Perhaps the phrase “it is argued that” indicates that it is just one of many views, but elsewhere in the document points from ASA II are reported without qualification. Fortunately, the document does not include the ASA II recommendation not to use the words “significant/significance”). Later in the book, give items from the 2016 ASA Statement on P-Values and Statistical Significance (ASA I), which is largely a consensus document.

[iii] They do regard the point as “reinforced by [ASA II] in which the use of a statistical significance threshold in reporting is strongly discouraged due to overuse and wide misinterpretation (Wasserstein et al., 2019).

[iv] Nor is “double counting” necessarily pejorative when testing explanations of a known effect. I delineate the cases according to whether a severity assessment of the inference of interest is invalidated)

Mayo, D. 2018. Statistical Inference as severe Testing: How to Get Beyond the Statistics Wars. (SIST) CUP.

You can find many excerpts and mementos from SIST on this blog on this post.