I came across this post on Gelman’s blog today:

**Exchange with Deborah Mayo on abandoning statistical significance**

It was straight out of blog comments and email correspondence back when the ASA, and significant others, were rising up against the concept of statistical significance. Here it is:

#### Exchange with Deborah Mayo on abandoning statistical significance

The philosopher wrote:

The big move in the statistics wars these days is to fight irreplication by making it

harderto reject, and find evidence against, a null hypothesis.

Mayo is referring to, among other things, the proposal to “redefine statistical significance” as p less than 0.005. My colleagues and I do not actually like that idea, so I responded to Mayo as follows:

I don’t know what the big moves are, but my own perspective, and I think that of the three authors of the recent article being discussed, is that we should not be “rejecting” at all, that we should move beyond the idea that the purpose of statistics is to reject the null hypothesis of zero effect and zero systematic error.

I don’t want to ban speech, and I don’t think the authors of that article do, either. I’m on record that I’d like to see everything published, including Bem’s ESP

~~paper~~data and various other silly research. My problem is with the idea that rejecting the null hypothesis tells us anything useful.

Mayo replied:

I just don’t see that you can really mean to say that nothing is learned from finding low-p values, especially if it’s not an isolated case but time and again. We may know a hypothesis/model is strictly false, but we do not yet know in which way we will find violations. Otherwise we could never learn from data. As a falsificationist, you must think we find things out from discovering our theory clashes with the facts–enough even to direct a change in your model. Even though inferences are strictly fallible, we may argue from coincidence to a genuine anomaly & even to pinpointing the source of the misfit.So I’m puzzled.

I hope that “only” will be added to the statement in the editorial to the ASA collection. Doesn’t the ASA worry that the whole effort might otherwise be discredited as anti-science?

My response:

The problem with null hypothesis significance testing is that rejection of straw-man hypothesis B is used as evidence in favor of preferred alternative A. This is a disaster. See here.

Then Mayo:

I know all this. I’ve been writing about it for donkey’s years. But that’s a testing

fallacy. N-P and Fisher couldn’t have been clearer. That does not mean we learn nothing from a correct use of tests. N-P tests have a statistical alternative and at most one learns, say, about a discrepancy from a hypothesized value. If a double blind RCT clinical trial repeatedly shows statistically significant (small p-value) increase in cancer risks among exposed, will you deny that’s evidence?

Me:

I don’t care about the people, Neyman, Fisher, and Pearson. I care about what researchers do. They do something called NHST, and it’s a disaster, and I’m glad that Greenland and others are writing papers pointing this out.

Mayo:

We’ve been saying this for years and years. Are you saying you would no longer falsify models because some people will move from falsifying a model to their favorite alternative theory that fits the data? That’s crazy. You don’t give up on correct logic because some people use illogic. The clinical trials I’m speaking about do not commit those crimes. would you really be willing to say that they’re all bunk because some psychology researchers do erroneous experiments and make inferences to claims where we don’t even know we’re measuring the intended phenomenon?

Ironically, by the way, the Greenland argument only weakens the possibility of finding failed replications.

Me:

I pretty much said it all here.

I don’t think clinical trials are all bunk. I think that existing methods, NHST included, can be adapted to useful purposes at times. But I think the principles underlying these methods don’t correspond to the scientific questions of interest, and I think there are lots of ways to do better.

Mayo:

And I’ve said it all many times in great detail. I say drop NHST. It was never part of any official methodology. That is no justification for endorsing official policy that denies we can learn from statistically significant effects in controlled clinical trials among other legitimate probes. Why not punish the wrong-doers rather than all of science that uses statistical falsification?

Would critics of statistical significance tests use a drug that resulted in statistically significant increased risks in patients time and again? Would they recommend it to members of their family? If the answer to these questions is “no”, then they cannot at the same time deny that anything can be learned from finding statistical significance.

Me:

In those cases where NHST works, I think other methods work better. To me, the main value of significance testing is: (a) when the test

doesn’treject, that tells you your data are too noisy to reject the null model, and so it’s good to know that, and (b) in some cases as a convenient shorthand for a more thorough analysis, and (3) for finding flaws in models that we are interested in (as in chapter 6 of BDA). I would not use significance testing to evaluate a drug, or to prove that some psychological manipulation has a nonzero effect, or whatever, and those are the sorts of examples that keep coming up.In answer to your previous email, I don’t want to punish anyone, I just think statistical significance is a bad idea and I think we’d all be better off without it. In your example of a drug, the key phrase is “time and again.” No statistical significance is needed here.

Mayo:

One or two times would be enough if they were well controlled. And the ONLY reason they have meaning even if it were time and time again is because they are well controlled. I’m totally puzzled as to how you can falsify models using p-values & deny p-value reasoning.

As I discuss through my book, Statistical Inference as Severe Testing, the most important role of the severity requirement is to block claims—precisely the kinds of claims that get support under other methods be they likelihood or Bayesian.

Stop using NHST—there’s speech ban I can agree with. In many cases the best way to evaluate a drug is via controlled trials. I think you forget that for me, since any claim must be well probed to be warranted, estimations can still be viewed as tests.

I will stop trading in biotechs if the rule to just report observed effects gets passed and the responsibility that went with claiming a genuinely statistically significant effect goes by the board.That said, it’s fun to be talking with you again.

Me:

I’m interested in falsifying real models, not straw-man nulls of zero effect. Regarding your example of the new drug: yes, it can be solved using confidence intervals, or z-scores, or estimates and standard errors, or p-values, or Bayesian methods, or just about anything, if the evidence is strong enough. I agree there are simple problems for which many methods work, including p-values when properly interpreted. But I don’t see the point of using hypothesis testing in those situations either—it seems to make much more sense to treat them as estimation problems: how effective is the drug, ideally for each person or else just estimate the average effect if you’re ok fitting that simpler model.

I can blog our exchange if you’d like.

And so I did.

Please be polite in any comments. Thank you.

I was glad to see that I’d pretty much said just what I’d want to say. I might have wanted to get the last word in regarding his last remark, namely that I think the task of distinguishing genuine from spurious effects is crucial. If you start out thinking you’re “estimating” something when it could readily have been exposed as noise, you will be led astray. The only confusion in what I’d said might be as regards the term “NHST”. On this, see comments to this post and my “Farewell Keepsake” from SIST (2018, CUP)