When science writers, especially “statistical war correspondents”, contact you to weigh in on some article, they may talk to you until they get something spicy, and then they may or may not include the background context. So a few writers contacted me this past week regarding this article (“Retire Statistical Significance”)–a teaser, I now suppose, to advertise the ASA collection growing out of that conference “a world beyond P ≤ .05” way back in Oct 2017, where I gave a paper*. I jotted down some points, since Richard Harris from NPR needed them immediately, and I had just gotten off a plane when he emailed. He let me follow up with him, which is rare and greatly appreciated. So I streamlined the first set of points, and dropped any points he deemed technical. I sketched the third set for a couple of other journals who contacted me, who may or may not use them. Here’s Harris’ article, which includes a couple of my remarks.
1.We agree with the age-old fallacy of non-rejection of a null hypothesis: a non-statistically significant result at level P is not evidence for the null because a test may have low probability of rejecting a null even if it’s false (i.e., it might have low power to detect a particular alternative).
The solution in the severity interpretation of tests is to take a result that is not statistically significant at a small level, i.e., a large P-value, as ruling out given discrepancies from the null or other reference value:
The data indicate that discrepancies from the null are less than those parametric values the test had a high probability of detecting, if present. See p. 351 of Statistical Inference as Severe Testing: How to Get Beyond the Statistics wars (2018, CUP). [i]
This is akin to the use of power analysis, except that it is sensitive to the actual outcome. It is very odd that this paper makes no mention of power analysis, since that is the standard way to interpret non-significant results.
Using non-significant results (“moderate” P-values) to set upper bounds is done throughout the sciences and is highly informative. This paper instead urges us to read into any observed difference found to be in the welcome direction, to potentially argue for an effect.
2. I agree that one shouldn’t mechanically use P< .05. Ironically, they endorse a .95 confidence interval CI. They should actually use several levels, as is done with a severity assessment.
I have objections to their interpretation of CIs, but I will mainly focus my objections to the ban of the words “significance” or “significant”. It’s not too hard to report that results are significant at level .001 or whatever. Assuming researchers invariably use an unthinking cut-off, rather than reporting the significance level attained by the data, they want to ban words. They claim this is a political fight, and so arguing by an appeal to numbers is appropriate for science. I think many will take this as yet one more round of significance test bashing–even though, amazingly, it is opposite to the most popular of today’s statistical wars. I explain in #3. (The actual logic of significance testing is lost in both types of criticisms.)
3. The most noteworthy feature of this criticism of statistical significance tests is that it is opposite to the most well-known and widely circulated current criticisms of significance tests.
In other words, the big move in the statistics wars these days is to fight irreplication by making it harder to reject, and find evidence against, a null hypothesis. The most well known Bayesian reforms being bandied about do this by giving a point prior–a lump of prior probability–to a point null hypothesis. (There’s no mention of this in the paper.)
These Bayesians argue that small P-values are consistent with strong evidence for the null hypothesis. They conclude that P-values exaggerate the evidence against the null hypothesis. Never mind for now that they are insisting P-values be measured against a standard that is radically different from what the P-value means. All of the criticisms invoke reasoning at odds with statistical significance tests. I want to point out the inconsistency between those reforms and the current one. I will call them Group A and Group B:
Group A: “Make it harder to find evidence against the null”: a P-value of .05 (i.e. a statistically significant result) should not be taken as evidence against the null, it may often be evidence for the null.
Group B (“Retire Stat Sig”): “Make it easier to find evidence against the null”: a P-value > .05 (i.e., a non-statistically significant result) should not be taken as evidence for the null, it may often be evidence against the null.
A proper use and interpretation of statistical tests (as set out in my SIST) interprets P-values correctly in both cases and avoids fallacies of rejection (inferring a magnitude of discrepancy larger than warranted) and fallacies of non-rejection (inferring the absence of an effect smaller than warranted).
The fact that we shouldn’t use thresholds unthinkingly does not mean we don’t need thresholds for lousy and terrible evidence! When data provide lousy evidence, when little if anything has been done to rule out known flaws in a claim, it’s not a little bit of evidence (on my account). The most serious concern with the “Retire” argument to ban thresholds for significance is that it is likely to encourage the practice whereby researchers spin their non-significant results by P-hacking or data dredging. It’s bad enough that they do this. Read Goldacre [ii]
Note their saying the researcher should discuss the observed difference. This opens the door to spinning it convincingly to the uninitiated reader.
5. What about selection effects? The really important question that is not mentioned in this paper is whether the researcher is allowed to search for endpoints post-data.
My own account replaces P-values with reports of how severely tested various claims are, whether formal or informal. If we are in a context reporting P-values, the phrase “statistically significant” at the observed P-value is important because the significance level is invalidated by multiple testing, optional stopping, data-dependent subgroups, and data dredging. Everyone knows that. (A P-value, by contrast, if detached from corresponding & testable claims about significance levels, is sometimes seen as a mere relationship between data and a hypothesis.) Getting rid of the term is just what is wanted by those who think the researcher should be free to scour the data in search of impressive-looking effects, or interpret data according to what they believe. Some aver that their very good judgment allows them to determine post-data what the pre-registered endpoints really are or were or should have been. (Goldacre calls this “trust the trialist”). The paper mentions pre-registration fleetingly, but these days we see nods to it that actually go hand in hand with flouting it.
The ASA P-value Guide very pointedly emphasizes that selection effects invalidate P-values. But it does not say that selection effects need to be taken into account by any of the alternative measures of evidence, including Bayesian and Likelihoodist. Are they free from Principle 4 on transparency, or not? Whether or when to take account of multiple testing and data dredging are known to be key points on which those accounts differ from significance tests (at least all those who hold to the Likelihood Principle, as with Bayes Factors and Likelihood Ratios).
6. A few asides:
They should really be doing one-sided tests and do away with the point null altogether. (Then the test hypothesis and alternative hypothesis are symmetrical as with N-P tests.)
The authors seem to view a test as a report on parameter values that merely fit or are compatible with data. This misses testing reasoning! Granted the points within a CI aren’t far enough away to reject the null at level .05–but that doesn’t mean there’s evidence for them. In other words, they commit the same fallacy they are on about, but regarding members of the CI. In fact there is fairly good evidence the parameter value is less than those values close to the upper confidence limit. Yet this paper calls them compatible, even where there’s strong evidence against them.
[Using one-sided tests and letting the null assert: a positive effect exists, the recommended account is tantamount to taking the non-significant result as evidence for this null.]
Second Set (to briefly give the minimal non-technical points):
I do think we should avoid the fallacy of going from a large P-value to evidence for a point null hypothesis: inferring evidence of no effect.
CIs at the .95 level are more dichotomous than reporting attained P-values for various hypotheses.
The fact that we shouldn’t use thresholds unthinkingly does not mean we don’t need thresholds for lousy and terrible evidence!
The most serious concern with the argument to ban thresholds for significance is that it encourages researchers to spin their non-significant results by P-hacking, data dredging, multiple testing, and outcome-switching.
I would like to see some attention paid to how easy it is to misinterpret results with Bayesian and Likelihoodist methods. Obeying the LP, there is no onus to take account of selection effects, and priors are very often data-dependent, giving even more flexibility.
Third Set (for different journals)
Banning the word “significance” may well free researchers from being held accountable when they downplay negative results and search the data for impressive-looking subgroups.
I would like to see some attention paid to how easy it is to misinterpret results on Bayesian and Likelihoodist methods. The brouhaha is all about a method that plays a small role in an overarching methodology that is able to bound the probabilities of seriously misleading interpretations of data. These are called error probabilities. Their role is just a first indication of whether results could readily be produced by chance variability alone.
Rival schools of statistics (the ASA Guide’s “alternative accounts of evidence”) have never shown their worth in controlling error probabilities of methods. (Without this, we cannot assess their capability for having probed mistaken interpretations of data).
Until those alternative methods are subject to scrutiny for the same or worse abuses–biasing selection effects, we should be wary of ousting these methods.
One needs to consider a statistical methodology as a whole–not one very small piece. That full methodology may be called error statistics. (Pulling the simple significance test with a point null & no alternative or power consideration out of context, as in the ASA Guide, hardly does justice to the overall methodology of which these tests are just a piece. Error statistics is known to be a piecemeal account–it’s highly distorting to focus on an artificial piece of it.)
Those who use these methods with integrity never recommend using a single test to move from statistical significance to a substantive scientific claim. Once a significant effect is found, they move on to estimating its effect size & exploring properties of the phenomenon. I don’t favor existing testing methodologies but rather reinterpret tests as a way to infer discrepancies that are well or poorly indicated. I described this account over 25 years ago.
On the other hand, simple significance tests are important for testing assumptions of statistical models. Bayesians, if they test their assumptions, use them as well, so they could hardly ban them entirely. But what are P-values measuring? OOPS! you’re not allowed to utter the term s____ance level that was coined for the thing your P-values are measuring. Big Brother has dictated! (Look at how strange it is to rewrite Goldacre’s claim below without it. [ii])
Moreover, the lead editorial in the new “world after P ≤ 0.05” collection warns us that even if scientists repeatedly show statistically significant increases (p< 0.01 or 0.001) in lead poisoning among children in Flint, we mustn’t “conclude anything about scientific or practical importance” such as the water is causing lead poisoning.
“Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof)” (p.1, editorial for the Special Issue).
Following this rule, and note the qualification that had been in the ASA Guide is missing, would mean never inferring risks of concern when there was uncertainty (among much else that would go by the wayside). Risks have to be so large and pervasive that no statistics is needed! Statistics is just window dressing, with no actual upshot about the world. Menopausal women would still routinely be taking and dying from hormone replacement therapy because “real world” observational results are compatible with HRT staving off age-related diseases.
Welcome to the brave new world after abandoning error control.
See also my post “Deconstructing ‘A World Beyond P-values’”
[i] Mayo, D. (2018). Statistical Inference as Severe Testing: How To Get Beyond the Statistics Wars, Cambridge: Cambridge University Press.
[ii] Should we replace the offending terms with “moderate or non-small P-values”? The required level for “significance” is separately reported.
Misleading reporting by presenting a study in a more positive way than the actual results reflect constitutes ‘spin’. Authors of an analysis of 72 trials with non-significant results reported it was a common phenomenon, with 40% of the trials containing some form of spin. Strategies included reporting on statistically significant results for within-group comparisons, secondary outcomes, or subgroup analyses and not the primary outcome, or focussing the reader on another study objective away from the statistically non-significant result. (Goldacre)