What’s the p-value good for: I answer some questions.

Martin King writes:

For a couple of decades (from about 1988 to 2006) I was employed as a support statistician, and became very interested in the p-value issue; hence my interest in your contribution to this debate. (I am not familiar with the p-value ‘reconciliation’ literature, as published after about 2005.) I would hugely appreciate it, if you might find the time to comment further on some of the questions listed in this document.

I would be particularly interested in learning more about your views on strict Neyman-Pearson hypothesis testing, based on critical values (critical regions), given an insistence on power calculations among research funding organisations (i.e., first section headed ‘p-value thresholds’), and the long-standing recommendation that biomedical researchers should focus on confidence intervals instead of p-values (i.e., penultimate section headed ‘estimation and confidence intervals’).

Here are some excerpts from King’s document that I will respond to:

My main question is about ‘dichotomous thinking’ and p-value thresholds. McShane and Gal (2017, page 888) refers to “dichotomous thinking and similar errors”. Is it correct to say that dichotomous thinking is an error? . . .

If funding bodies insist on strict hypothesis testing (otherwise why the insistence on power analysis, as opposed to some other assessment of adequate precision), is it fair to criticise researchers for obeying the rules dictated by the method? In summary, before banning p-value thresholds, do you have to persuade the funding bodies to abandon their insistence on power calculations, and allow applicants more flexibility in showing that a proposed study has sufficient precision? . . .

This brings us to the second question regarding what should be taught in statistics courses, aimed at biomedical researchers. A teacher might want the freedom to design courses that assumes an ideal world in which statisticians and researchers are free to adopt a rational approach of their choice. Thus, a teacher might decide to drop frequentist methods (if she/he regards frequentist statistics a nonsense) and focus on the alternatives. But this creates a problem for the course recipients, if grant awarding bodies and journal editors insist on frequentist statistics? . . .

It is suggested (McShane et al. 2018) that researchers often fail to provide sufficient information on currently subordinate factors. I spent many years working in an experimental biomedical environment, and it is my impression that most experimental biomedical researchers do present this kind of information. (They do not spend time doing experiments that are not expected to work or collecting data that are not expected to yield useful and substantial information. It is my impression that some authors go to the extreme in attempting to present an argument for relevance and plausibility.) Do you have a specific literature in mind where it is common to see results offered with no regard for motivation, relevance, mechanism, plausibility etc. (apart from data dredging/data mining studies in which mechanism and plausibility might be elusive)? . . .

For many years it had not occurred to me that there is a distinction between looking at p-values (or any other measure of evidence) obtained as a participant in a research study, versus looking at third-party results given in some publication, because the latter have been through several unknown filters (researcher selection, significance filter etc). Although others had commented on this problem, it was your discussions on the significance filter that prompted me to fully realise the importance of this issue. Is it a fact that there is no mechanism by which readers can evaluate the strength of evidence in many published studies? I realise that pre-registration has been proposed as a partial solution to this problem. But it is my impression that, of necessity, much experimental and basic biomedical science research takes the form of an iterative and adaptive learning process, as outlined by Box and Tiao (pages 4-5), for example. I assume that many would find It difficult to see how pre-registration (with constant revision) would work in this context, without imposing a massive obstacle to making progress.

And now my response:

1. Yes, I think dichotomous frameworks are usually a mistake in science. With rare exceptions, I don’t think it makes sense to say that an effect is there or not there. Instead I’d say that effects vary.

Sometimes we don’t have enough data to distinguish an effect from zero, and that can be a useful thing to say. Reporting that an effect is not statistically significant can be informative, but I don’t think it should be taken as an indication that the true effect as zero; it just tells us that our data and model do not give us enough precision to distinguish the effect from zero.

2. Sometimes decisions have to be made. That’s fine. But then I think the decisions should be made based on estimated costs, benefits, and probabilities—not based on the tail-area probability with respect of a straw-man null hypothesis.

3. If scientists in the real world are required to do X, Y, and Z, then, yes, we should train them on how to do X, Y, and Z, but we should also explain why these actions can be counterproductive to larger goals of scientific discovery, public health, etc.

Perhaps a sports analogy will help. Suppose you’re a youth coach, and your players would like to play in an adult league that uses what you consider to be poor strategies. Short term, you need to teach your players these poor strategies so they can enter the league on the league’s terms. But you should also teach them the strategies that will ultimately be more effective so that, once they’ve established themselves, or if they happen to play with an enlightened coach, they can really shine.

4. Regarding “currently subordinate factors”: In many many of the examples we’ve discussed over the years on this blog, published papers do not include raw data or anything close to it, they don’t give details on what data were collected or how the data were processed or what data were excluded. Yes, there will be lots of discussion of motivation, relevance, mechanism, plausibility etc. of the theories, but not much thought about data quality. Some quick examples include the evolutionary psychology literature, where the days of peak fertility were mischaracterized or measurement of finger length was characterized as a measure of testosterone. There’s often a problem that data and measurements are really noisy, and authors of published papers (a) don’t even address the point and (b) don’t seem to think it matters, under the (fallacious) reasoning that, once you have achieved statistical significance, measurement error doesn’t matter.

5. Preregistration is fine for what it is, but I agree that it does not resolve issues of research quality. At best, preregistration makes it more difficult for people to make strong claims from noise (although they can still do it!), hence it provides an indirect incentives for people to gather better data and run stronger studies. But it’s just an incentive; a noisy study that is preregistered is still a noisy study.

Summary

I think that p-values and statistical significance as used in practice are a noise magnifier, and I think people would be better off reporting what they find without the need to declare statistical significance.

There are times when p-values can be useful: it can help to know that a certain data + model are weak enough that we can’t rule out some simple null hypothesis.

I don’t think the p-value is a good measure of the strength of evidence for some claim, and for several reasons I don’t think it makes sense to compare p-values. But the p-value as one piece of evidence in a larger argument about data quality, that can make sense.

Finally the above comments apply not just to p-values but to any method used for null hypothesis significance testing.