“The 2019 ASA Guide to P-values and Statistical Significance: Don’t Say What You Don’t Mean” (Some Recommendations)

Some have asked me why I haven’t blogged on the recent follow-up to the ASA Statement on P-Values and Statistical Significance (Wasserstein and Lazar 2016)–hereafter, ASA I. They’re referring to the editorial by Wasserstein, R., Schirm, A. and Lazar, N. (2019) –hereafter, ASA II–opening a special on-line issue of over 40 contributions responding to the call to describe “a world beyond P < 0.05”.[1] Am I falling down on the job? Not really. All of the issues are thoroughly visited in my Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, SIST (2018, CUP). I invite interested readers to join me on the statistical cruise therein[2]. As the ASA II authors observe: “At times in this editorial and the papers you’ll hear deep dissonance, the echoes of ‘statistics wars’ still simmering today (Mayo 2018)”. True, and reluctance to reopen old wounds has only allowed them to fester. However, I will admit, that when new attempts at reforms are put forward, a philosopher of science who has written on the statistics wars ought to weigh in on the specific prescriptions/proscriptions, especially when a jumble of fuzzy conceptual issues are interwoven through a cacophony of competing reforms. (My published comment on ASA I, “Don’t Throw Out the Error Control Baby With the Bad Statistics Bathwater” is here.)

So I should say something. But the task is delicate. And painful. Very. I should start by asking: What is it (i.e., what is it actually saying)? Then I can offer some constructive suggestions

The Invitation to Broader Consideration and Debate

The papers in this issue propose many new ideas, ideas that in our determination as editors merited publication to enable broader consideration and debate. The ideas in this editorial are likewise open to debate. (ASAII p. 1)

The questions around reform need consideration and debate. (p. 9)

Excellent! A broad, open, critical debate is sorely needed. Still, we can only debate something when there is a degree of clarity as to what “it” is. I will be very happy to post reader’s meanderings on ASA II (~1000 words) if you send them to me.

My focus here is just on the intended positions of the ASA, not the summaries of articles. This comprises around the first 10 pages. Even from just the first few pages the reader is met with some noteworthy declarations:

♦ Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof)” (p. 1)

♦ No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. (p.2)

♦ Whether a p-value passes any arbitrary threshold should not be considered at all when deciding which results to present or highlight. (p. 2, my emphasis)

♦ A declaration of statistical significance is the antithesis of thoughtfulness. (p. 4)

♦ It is time to stop using the term “statistically significant” entirely. Nor should variants such as “significantly different,” “p < 0.05,” and “nonsignificant” survive. (p.2)

♦ “Statistically significant”– don’t say it and don’t use it (p. 2)

(Wow!)

I am very sympathetic with the concerns about rigid cut-offs, and fallacies of moving from statistical significance to substantive scientific claims. I feel as if I’ve just written a whole book on it! I say, on p. 10 of SIST:

In formal statistical testing, the crude dichotomy of “pass/fail” or “significant or not” will scarcely do. We must determine the magnitudes (and directions) of any statistical discrepancies warranted, and the limits to any substantive claims you may be entitled to infer from the statistical ones.

Since ASA II will still use P-values, you’re bound to wonder why a user wouldn’t just report “the difference is statistically significant at the P-value attained”. (The probability of observing even larger differences, under the assumption of chance variability alone is p.) Confidence intervals CIs are already routinely given along side P-values. So there is clearly more to the current movement than meets the eye. But for now I’m just trying to decipher what the ASA position is.

What’s the Relationship Between ASA I and ASA II?

I assume, for this post, that ASA II is intended to be an extension of ASA I. In that case, it would subsume the 6 principles of ASA I. There is evidence for this. For one thing, it begins by sketching a “sampling” of “don’ts” from ASA I, for those who are new to the debate. Secondly, it recommends that ASA I be widely disseminated. But some Principles (1, 4) are apparently missing[3], and others are rephrased in ways that alter the initial meanings. Do they really mean these declarations as written? Let us take them at their word.

But right away we are struck with a conflict with Principle 1 of ASA I–which happens to be the only positive principle given.

1. P-values can indicate how incompatible the data are with a specified statistical model.

A p-value provides one approach to summarizing the incompatibility between a particular set of data and a proposed model for the data. The most common context is a model, constructed under a set of assumptions, together with a so-called “null hypothesis.” Often the null hypothesis postulates the absence of an effect, such as no difference between two groups, or the absence of a relationship between a factor and an outcome. The smaller the p-value, the greater the statistical incompatibility of the data with the null hypothesis, if the underlying assumptions used to calculate the p-value hold. This incompatibility can be interpreted as casting doubt on or providing evidence against the null hypothesis or the underlying assumptions.” (ASA I p. 131)

However, an indication of how incompatible data are with a claim of the absence of a relationship between a factor and an outcome would be an indication of the presence of the relationship; and providing evidence against a claim of no difference between two groups would often be of scientific or practical importance.

So, Principle 1 (from ASA I) doesn’t appear to square with the first bulleted item I listed (from ASA II):

(1) “Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof)” (p.1, ASA II).

Either modify (1) or erase Principle 1. But if you erase all thresholds for finding incompatibility (whether using P-values or other measures), there are no tests, and no falsifications, even of the statistical kind.

My understanding (from Ron Wasserstein) is that this bullet is intended to correspond to Principle 5 in ASA I – that P-values do not give population effect sizes. But it is now saying something stronger (at least to my ears and to everyone else I’ve asked). Do the authors mean to be saying that nothing (of scientific or practical importance) can be learned from statistical significance tests? I think not.

So, my first recommendation is:

Replace (1) with:

“Don’t conclude anything about the scientific or practical importance of the (population) effect size based only on statistical significance (or lack thereof).”

Either that, or simply stick to Principle 5 from ASA I : “A p-value, or statistical significance[4], does not measure the size of an effect or the importance of a result.” (p. 132) This statement is, strictly speaking, a tautology, true by the definitions of terms: probability isn’t itself a measure of the size of a (population) effect. However, you can use statistically significant differences to infer what the data indicate about the size of the (population) effect.[5]

My second friendly amendment concerns the second bulleted item:

(2) No p-value can reveal the plausibility, presence, truth, or importance of an association or effect. (p. 2)

Focus just on “presence”. From this assertion it would seem to follow that no P-values[6], however small, even from a well-controlled trials, can reveal the presence of an association or effect–and that is too strong. But I’m guessing, for now, the authors do not mean to say this. If you don’t mean it, don’t say it.

So, my second recommendation is to replace (2) with:

 “No p-value by itself can reveal the plausibility, presence, truth, or importance of an association or effect.

Without this friendly amendment, ASA II is at loggerheads with ASA I, and they should not be advocating those 6 principles without changing either or both. Without this or a similar modification, moreover, the ability of any other statistical quantity or evidential measure is likewise unable to reveal these things. Or so many would argue. These modest revisions might prevent some readers stopping after the first few pages, and that would be a shame, as they would miss the many right-headed insights about linking statistical and scientific inference.

This leads to my third bulleted item from ASA II:

(3) A declaration of statistical significance is the antithesis of thoughtfulness… it ignores what previous studies have contributed to our knowledge. (p. 4)

Surely the authors do not mean to say that anyone who asserts the observed difference is statistically significant at level p has her hands tied and invariably ignores all previous studies, background information and theories in planning and reaching conclusions, decisions, proposed solutions to problems. I’m totally on board with the importance of backgrounds, and multiple steps relating data to scientific claims and problems:

The error statistician begins with a substantive problem or question. She jumps in and out of piecemeal statistical tests both formal and quasi-formal.The pieces are integrated in building up arguments from coincidence, informing background theory, self-correcting via blatant deceptions, in an iterative movement. The inference is qualified by using error probabilities to determine not “ how probable,”  but rather, “ how well-probed”  claims are, and what has been poorly probed. (SIST, p. 162)

But good inquiry is piecemeal: There is no reason to suppose one does everything at once in inquiry, and it seems clear from the ASA II guide that the authors agree. Since I don’t think they literally mean (3), why say it?

Practitioners who use these methods in medicine and elsewhere have detailed protocols for how background knowledge is employed in designing, running, and interpreting tests. When medical researchers specify primary outcomes, for just one example, it’s very explicitly with due regard for the mechanism of drug action. It’s intended as the most direct way to pick up on the drug’s mechanism. Finding incompatibility using P-values, inherits the meaning already attached to a sensible test hypothesis. That valid P-values require context is even in the very important Principle 4 of ASA I (see note (3).

As lawyer Nathan Schachtman observes, in a recent conversation on ASA II:

By the time a phase III clinical trial is being reviewed for approval, there is a mountain of data on pharmacology, pharmacokinetics, mechanism, target organ, etc. If Wasserstein wants to suggest that there are some people who misuse or misinterpret p-values, fine. The principle of charity requires that we give a more sympathetic reading to the broad field of users of statistical significance testing. (Schachtman 2019)

Now it is possible the authors are saying a reported P-value can never be thoughtful because thoughtfulness requires that a statistical measure, at any stage of probing, incorporate everything we know (SIST dubs this “big picture” inference.) Do we want that? Or maybe (3) is their way of saying a statistical measure must incorporate background beliefs in the manner of Bayesian degree-of-belief (?) priors. Many would beg to differ, including some leading Bayesians. Andrew Gelman (2012) has suggested that ‘Bayesians Want Everybody Else to be Non-Bayesian’:

Bayesian inference proceeds by taking the likelihoods from different data sources and then combining them with a prior (or, more generally, a hierarchical model). The likelihood is key. . .  No funny stuff, no posterior distributions, just the likelihood. . . I don’t want everybody coming to me with their posterior distribution – I’d just have to divide away their prior distributions before getting to my own analysis. (ibid., p. 54)

So, my third recommendation is to replace (3) with (something like):

failing to report anything beyond a declaration of statistical significance is the antithesis of thoughtfulness.”

There’s much else that bears critical analysis and debate in ASA II; I’ll come back to it. I hope to hear from the authors of ASA II about my very slight, constructive amendments (to avoid a conflict with Principle 1).

Meanwhile, I fear we will see court cases piling up denying that anyone can be found culpable for abusing p-values and significance tests, since the ASA declared that all p-values are arbitrary, and whether predesignated thresholds are honored or breeched should not be considered at all. (This was already happening based on ASA I.)

Please share your thoughts and any errors in the comments, I will indicate later drafts of this post with (i), (ii),…Do send me other articles you find discussing this.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 

References:

Gelman, A. 2012” Ethics and the Statistical Use of Prior Information”. http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics5.pdf

Mayo, D. (2016). “Don’t Throw out the Error Control Baby with the Bad Statistics Bathwater: A Commentary” on R. Wasserstein and N. Lazar: “The ASA’s Statement on P-values: Context, Process, and Purpose”, The American Statistician 70(2).

Mayo, D. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars. Cambridge: Cambridge University Press.

Schachtman, N.  (2019).  (private communication)

Wasserstein, R. and Lazar, N. (2016). “The ASA’s Statement on P-values: Context, Process and Purpose”, (and supplemental materials), The American Statistician 70(2), 129–33. (ASA I)

Wasserstein, R., Schirm, A. and Lazar, N. (2019) Editorial: “Moving to a World Beyond ‘p < 0.05’”, The American Statistician 73(S1): 1-19. (ASA II)

 

[1] I gave an invited paper at the conference (“A world Beyond…”) out of which the idea for this volume grew. I was in a session with a few other exiles to describe the contexts where statistical significance tests are of value. I was too much involved in completing my book to write up my paper for this volume, nor did others in our small group. Links to my slides and Yoav Benjamini’s are below.

I did post notes to journalists on the Amrhein article here.

[2]Excerpts and mementos from SISTare here.

[3] Principle 4 ASA I asserts that “proper inference requires full reporting and transparency”:

P-values and related analyses should not be reported selectively. Conducting multiple analyses of the data and reporting only those with certain p-values (typically those passing a significance threshold) renders the reported p-values essentially uninterpretable. ….Whenever a researcher chooses what to present based on statistical results, valid interpretation of those results is severely compromised if the reader is not informed of the choice and its basis. Researchers should disclose the number of hypotheses explored during the study, all data collection decisions, all statistical analyses conducted, and all p-values computed. (pp. 131-2)

[5] Consider, for example, a two-sided (symmetric) 95% confidence interval estimate of Normal mean: [a, b]. This information can also be given in terms of observed significance levels.

CI-lower is the (parameter) value that the data x are just statistically significantly greater than, at the 0.025 level.

CI-upper is the (parameter) value that the data x are just statistically significantly smaller than, at the 0.025 level.

There’s a clear duality between statistical significance tests and confidence intervals. (The CI contains those parameter values that would not be rejected at the corresponding significance level, were they the hypotheses under test.) CIs were developed by the same man who co-developed Neyman-Pearson (N-P) tests in the same years (~1930): Jerzy Neyman. There are other ways to get indicated effect sizes such as with (attained) power analysis and the P-value distribution over different values of the parameter. The goal of assessing how severely tested a claim is serves to direct this analysis (Mayo 2018). However, the mathematical computations are well-known (see Fraser’s article in the collection), and continue to be extended in work on Confidence Distributions. See this blog or SIST for references.

However, confidence intervals as currently used in reform movements inherit many of the weaknesses of N-P tests: they are dichotomous (inside/outside), adhere to a single confidence level, and are justified merely with a long-run performance (or coverage) rationale. By considering the P-values associated with different hypotheses (corresponding to parameter values in the interval), one can scotch all of these weaknesses.

It is often claimed that anything tests can do CIs do better (sung to the tune of “Annie Get Your Gun”). Not so. (See SIST p. 356). It is odd and ironic that psychologists urging us to use CIs depict statistical tests as exclusively of the artificial “simple” Fisherian variety, with a “nil” null and no explicit alternative, given how Paul Meehl chastised this tendency donkey’s years ago, and given that Jacob Cohen advanced power analysis.

A salutary effect of power analysis is that it draws one forcibly to consider the magnitude of effects. In psychology, and especially in soft psychology, under the sway of the Fisherian scheme, there has been little consciousness of how big things are. (Cohen 1990, p. 1309)

See SIST, p. 323. For links to all of excursion 5 on power, see this post.

Of course, the beauty of the simple Fisherian test shows itself when there is no explicit alternative, as when testing assumptions of models–models that all the alternative statistical methods on offer also employ. ASA I also limits itself to the simple Fisherian test: “To keep the statement reasonably simple, we did not address alternative hypotheses, error types, or power…” (p. 130)

[6] I assume they intend to make claims about valid P-values, not those that are discredited by failing “audits” due either to violated assumptions, or to multiple testing and other selection effects given in Principle 4, ASA I. The, largely unexceptional, six principles of ASA I (2016) are:

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.