We had some good discussion the other day following up on the article, “Retire Statistical Significance,” by Valentin Amrhein, Sander Greenland, and Blake McShane.
I have a lot to say, and it’s hard to put it all together, in part because my collaborators and I have said much of it already, in various forms.
For now I thought I’d start by listing my different thoughts in a short post while I figure out how best to organize all of this.
There’s also the problem that these discussions can easily transform into debates. After proposing an idea and seeing objections, it’s natural to then want to respond to those objections, then the responders respond, etc., and the original goals are lost.
So, before going on, some goals:
– Better statistical analyses. Learning from data in a particular study.
– Improving the flow of science. More prominence to reproducible findings, less time wasted chasing noise.
– Improving scientific practice. Changing incentives to motivate good science and demotivate junk science.
Null hypothesis testing, p-values, and statistical significance represent one approach toward attaining the above goals. I don’t think this approach works so well anymore (whether it did in the past is another question), but the point is to keep these goals in mind.
Some topics to address
1. Is this all a waste of time?
The first question to ask is, why am I writing about this at all? Paul Meehl said it all fifty years ago, and people have been rediscovering the problems with statistical-significance reasoning every decade since, for example this still-readable paper from 1985, The Religion of Statistics as Practiced in Medical Journals, by David Salsburg, which Richard Juster sent me the other day. And, even accepting the argument that the battle is still worth fighting, why don’t I just leave this in the capable hands of Amrhein, Greenland, McShane, and various others who are evidently willing to put in the effort?
The short answer is I think I have something extra to contribute. So far, my colleagues and I have come up with some new methods and new conceptualizations—I’m thinking of type M and type S errors, the garden of forking paths, the backpack fallacy, the secret weapon, “the difference between . . .,” the use of multilevel models to resolve the multiple comparisons problem, etc. We haven’t been just standing on the street corner the past twenty years, screaming “Down with p-values; we’ve been reframing the problem in interesting and useful ways.
How did we make these contributions? Not out of nowhere, but as a byproduct of working on applied problems, trying to work things out from first principles, and, yes, reading blog comments and answering questions from randos on the internet. When John Carlin and I write an article like this or this, for example, we’re not just expressing our views clearly and spreading the good word. We’re also figuring out much of it as we go along. So, when I see misunderstanding about statistics and try to clean it up, I’m learning too.
2. Paradigmatic examples
It could be a good idea to list the different sorts of examples that are used in these discussions. Here are a few that keep coming up:
The clinical trial comparing a new drug to the standard treatment. “Psychological Science” or “PNAS”-style headline-grabbing unreplicable noise mining. Gene-association studies. Regressions for causal inference from observational data. Studies with multiple outcomes. Descriptive studies such as in Red State Blue State.
I think we can come up with more of these. My point here is that different methods can work for different examples, so I think it makes sense to put a bunch of these cases in one place so the argument doesn’t jump around so much. We can also include some examples where p-values and statistical significance don’t seem to come up at all. For instance, MRP to estimate state-level opinion from national surveys: nobody’s out there testing which states are statistically significantly different from others. Another example is item-response or ideal-point modeling in psychometrics or political science: again, these are typically framed as problems of estimation, not testing.
3. Statistics and computer science as social sciences
We’re used to statistical methods being controversial, with leading statisticians throwing polemics at each other regarding issues that are both theoretically fundamental and also core practical concerns. The fighting’s been going on, in different ways, for about a hundred years!
But here’s a question. Why is it that statistics is so controversial? The math is just math, no controversy there. And the issues aren’t political, at least not in a left-right sense. Statistical controversies don’t link up in any natural way to political disputes about business and labor, or racism, or war, or whatever.
In its deep and persistent controversies, statistics looks less like the hard sciences and more like the social sciences. Which, again, seems strange to me, given that statistics is a form of engineering, or applied math.
Maybe the appropriate point of comparison here is not economics or sociology, which have deep conflicts based on human values, but rather computer science. Computer scientists can get pretty worked up about technical issues which to me seem unresolvable: the best way to structure a programming language, for example. I don’t like to label these disputes as “religious wars,” but the point is that the level of passion often seems pretty high, in comparison to the dry nature of the subject matter.
I’m not saying that passion is wrong! Existing statistical methods have done their part to slow down medical research: lives are at stake. Still, stepping back, the passion in statistical debates about p-values seems a bit more distanced from the ultimate human object of concern, compared to, say the passion in debates about economic redistribution or racism.
To return to the point about statistics and computer science: These two fields fundamentally are about how they are used. A statistical method or a computer ultimately connects to a human: someone has to decide what to do. So they both are social sciences, in a way that physics, chemistry, or biology are not, or not as much.
4. Different levels of argument
The direct argument in favor of the use of statistical significance and p-values is that it’s desirable to use statistical procedures with so-called type 1 error control. I don’t buy that argument because I think that selecting on statistical significance yields noisy conclusions. To continue the discussion further, I think it makes sense to consider particular examples, or classes of examples (see item 2 above). They talk about error control, I talk about noise, but both these concepts are abstractions, and ultimately it has to come down to reality.
There are also indirect arguments. For example: 100 million p-value users can’t be wrong. Or: Abandoning statistical significance might be a great idea, but nobody will do it. I’d prefer to have the discussion at the more direct level of what’s a better procedure to use, with the understanding that it might take awhile for better options to become common practice.
5. “Statistical significance” as a lexicographic decision rule
This is discussed in detail in my article with Blake McShane, David Gal, Christian Robert, and Jennifer Tackett:
[In much of current scientific practice], statistical significance serves as a lexicographic decision rule whereby any result is first required to have a p-value that attains the 0.05 threshold and only then is consideration—often scant—given to such factors as related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain.
Traditionally, the p < 0.05 rule has been considered a safeguard against noise-chasing and thus a guarantor of replicability. However, in recent years, a series of well-publicized examples (e.g., Carney, Cuddy, and Yap 2010; Bem 2011) coupled with theoretical work has made it clear that statistical significance can easily be obtained from pure noise . . . We propose that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with currently subordinate factors (e.g., related prior evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain) as just one among many pieces of evidence. We have no desire to “ban” p-values or other purely statistical measures. Rather, we believe that such measures should not be thresholded and that, thresholded or not, they should not take priority over the currently subordinate factors. We also argue that it seldom makes sense to calibrate evidence as a function of p-values or other purely statistical measures.
6. Confirmationist and falsificationist paradigms of science
I wrote about this a few years ago:
In confirmationist reasoning, a researcher starts with hypothesis A (for example, that the menstrual cycle is linked to sexual display), then as a way of confirming hypothesis A, the researcher comes up with null hypothesis B (for example, that there is a zero correlation between date during cycle and choice of clothing in some population). Data are found which reject B, and this is taken as evidence in support of A.
In falsificationist reasoning, it is the researcher’s actual hypothesis A that is put to the test.
It is my impression that in the vast majority of cases, “statistical significance” is used in confirmationist way. To put it another way: the problem is not just with the p-value, it’s with the mistaken idea that falsifying a straw-man null hypothesis is evidence in favor of someone’s pet theory.
7. But what if we need to make an up-or-down decision?
This comes up a lot. I recommend accepting uncertainty, but what if it’s decision time—what to do?
How can the world function if the millions of scientific decisions currently made using statistical significance somehow have to be done another way? From that perspective, the suggestion to abandon statistical significance is like a recommendation that we all switch to eating organically-fed, free-range chicken. This might be a good idea for any of us individually or with small groups, but it would just be too expensive to do on a national scale. (I don’t know if that’s true when it comes to chicken farming; I’m just making a general analogy here.)
Regarding the economics, the point that we made in section 4.4 of our paper is that decisions are not currently made in an automatic way. Papers are reviewed by hand, one at a time.
As Peter Dorman puts it:
The most important determinants of the dispositive power of statistical evidence should be its quality (research design, aptness of measurement) and diversity. “Significance” addresses neither of these. Its worst effect is that, like a magician, it distracts us from what we should be paying most attention to.
To put it another way, there are two issues here: (a) the potential benefits of an automatic screening or decision rule, and (b) using a p-value (null-hypothesis tail area probability) for such a rule. We argue against using screening rules (or, to use them much less often). But in the cases where screening rules are desired, we see no reason to use p-values for this.
8. What should we do instead?
To start with, I think many research papers would be improved if all inferences were replaced by simple estimates and standard errors, with these standard errors not used to decide whether effects should be declared real, but just to give a sense of baseline uncertainty.
As Eric Loken and I put it:
Without modern statistics, we find it unlikely that people would take seriously a claim about the general population of women, based on two survey questions asked to 100 volunteers on the internet and 24 college students. But with the p-value, a result can be declared significant and deemed worth publishing in a leading journal in psychology.
For a couple more examples, consider the two studies discussed in section 2 of this article. For both of them, nothing is gained and much is lost by passing results through the statistical significance filter.
Again, the use of standard errors and uncertainty intervals is not just significance testing in another form. The point is to use these uncertainties as a way of contextualizing estimates, not to declare things as real or not.
The next step is to recognize multiplicity in your problem. Consider this paper, which contains many analyses but not a single p-value or even a confidence interval. We are able to assess uncertainty by displaying results from multiple polls. Yes, it is possible to have data with no structure at all—a simple comparison with no replications—and for these, I’d just display averages, variation, and some averages and uncertainties—but this is rare, as such simple comparisons are typically part of a stream of results in a larger research project.
One can and should continue with multilevel models and other statistical methods that allow more systematic partial pooling of information from different sources, but the secret weapon is a good start.
My current plan to write this all up as a long article, Unpacking the Statistical Significance Debate and the Replication Crisis, and put it on Arxiv. That could reach people who don’t feel like engaging with blogs.
In the meantime, I’d appreciate your comments and suggestions.