(back to basics:) How is statistics relevant to scientific discovery?

Someone pointed me to this remark by psychology researcher Daniel Gilbert:

Publication is not canonization. Journals are not gospels. They are the vehicles we use to tell each other what we saw (hence “Letters” & “proceedings”). The bar for communicating to each other should not be high. We can decide for ourselves what to make of each other’s speech.

Which led me to this, where Gilbert approvingly quotes a biologist who wrote, “Science is doing what it always has done — failing at a reasonable rate and being corrected. Replication should never be 100%.”

I’m really happy to see this. Gilbert has been loud defender of psychology claims based on high-noise studies (for example, the ovulation-and-clothing paper) and not long ago was associated with the claim that “the replication rate in psychology is quite high—indeed, it is statistically indistinguishable from 100%.” This was in the context of an attack by Gilbert and others on a project in which replication studies were conducted on a large set of psychology experiments, and it was found that many of those previously published claims did not hold up under replication.

So I think this is a big step forward, that Gilbert and his colleagues are moving beyond denial, to a more realistic view that accepts that failure is a routine, indeed inevitable part of science, and that, just because a claim is published, even in a prestigious journal, that doesn’t mean it has to be correct.

Gilbert’s revised view—that the replication rate is not 100%, nor should it be—is also helpful in that, once you accept that published work, even papers by ourselves and our friends, can be wrong, this provides an opening for critics, the sort of scientists who Gilbert earlier referred to as “second stringers.”

Once you move to the view that “the bar for communicating to each other should not be high. We can decide for ourselves what to make of each other’s speech,” there is a clear role for accurate critics to move this process along. Just as good science is, ultimately, the discovery of truths that ultimately would be discovered by someone else sometime in the future, thus, the speeding along of a process that we’d hope would happen anyway, so good criticism should speed along the process of scientific correction.

Criticism, correction, and discovery all go together. Obviously discovery is the key part, otherwise there’d be nothing to criticize, indeed nothing to talk about. But, from the other direction, criticism and correction empower discovery.

If we are discouraged from criticizing published work—or if our criticism elicits pushback and attacks from the powerful, or if it’s too hard to publish criticisms and obtain data for replication—that’s bad for discovery, in three ways. First, criticizing errors allows new science to move forward in useful directions. We want science to be a sensible search, not a random walk. Second, learning what went wrong in the past can help us avoid errors in the future. That is, criticism can be methodological and can help advance research methods. Third, the potential for criticism should allow researchers to be more free in their speculation. If authors and editors felt that everything published in a top journal was gospel, there could well be too much caution in what to publish.

Just as, in economics, it is said that a social safety net gives people the freedom to start new ventures, in science the existence of a culture of robust criticism should give researchers a sense of freedom in speculation, in confidence that important mistakes will be caught.

Along with this is the attitude, which I strongly support, that there’s no shame in publishing speculative work that turns out to be wrong. We learn from our mistakes. Shame comes not when people make mistakes, but rather when they dodge criticism, won’t share their data, refuse to admit problems, and attack their critics.

But, yeah, let’s be clear: Speculation is fine, and we don’t want the (healthy) movement toward replication to create any perverse incentives that would discourage people from performing speculative research. What I’d like is for researchers to be more aware of when they’re speculating, both in their published papers and in their press materials. Not claiming a replication rate of 100%, that’s a start.

So let’s talk a bit about failed replications.

First off, as Gilbert and others have noted, an apparently failed replication might not be anything of the sort. It could be that the replication study found no statistical significance because it was too noisy; indeed, I’m not at all happy with the idea of using statistical significance, or any such binary rule, as a measure of success. Or it could be that the effect found in the original study occurs only in some situations and not others. The original and replication studies could differ in some important ways.

One thing that the replication does give you, though, is a new perspective. A couple years ago I suggested the “time-reversal heuristic” as a partial solution to the “research incumbency rule” in which people privilege the first study on a topic—even when the first study is small and uncontrolled and the second study is a large and careful replication.

In theory, an apparently failed replication can itself be a distraction, but in practice we often seem to learn a lot from these replications, for three reasons. First, the very act of performing a replication study can make us realize some of the difficulties and choices involved in the original study. This happened with us when we performed a replication of one of our own papers! Second, the failed replication casts some doubt on the original claim, which can motivate a more critical look of the original paper, and which can then reveal all sorts of problems that nobody noticed the first time. Third, lots of papers have such serious methodological problems that their conclusions are essentially little more than shufflings of random numbers—but not everyone understands methodological criticisms, so a replication study can be a convincer. Recall the recent paper with the replication prediction market: lots of these failed replications were no surprise to educated outsiders.

What, then, is—or should be—the role of statistics, and statistical criticism in the process of scientific research?

Statistics can help researchers in three ways:
– Design and data collection
– Data analysis
– Decision making.

Let’s go through each of these:

Design and data collection. Statistics can help us evaluate measures and can also give us a sense of how much accuracy we will need from our data to make strong conclusions later on. It turns out that many statistical intuitions developed many decades ago in the context of estimation of large effects with good data, do not work so well when estimating small effects with noisy data; see this article for discussion of that point.

Data analysis. As has been discussed many times, one of the sources of the recent replication crisis in science is the garden of forking paths: Researchers gather rich data but then only report a small subset of what they found: by selecting on statistical significance they are throwing away a lot of data and keeping a random subset. The solution is to report all your data with no selection and no arbitrary dichotomization. At this point, though, analysis becomes more difficult: analyzing a whole grid of comparisons is more difficult than analyzing just one simple difference. Statistical methods can come to the rescue here, in the form of multilevel models.

Decision making. One way to think about the crisis of replication is that if you make decisions based on selected statistically significant comparisons, you will overstate effect sizes. Then you have people going around making unrealistic claims, and it can take years of clean-up to dial back expectations.

And how does statistical criticism fit into all this? Criticism of individual studies has allowed us to develop our understanding, giving us insight into designing future studies and interpreting past work.

To loop back to Daniel Gilbert’s observations quoted at the top of this post: We want to encourage scientists to play with new ideas. To this purpose, I recommend the following steps:

Reduce the costs of failed experimentation by being more clear when research-based claims are speculative.

React openly to follow-up studies. Once you recognize that published claims can be wrong (indeed, that’s part of the process), don’t hang on to them too long or you’ll reduce your opportunities to learn.

Publish all your data and all your comparisons (you can do this using graphs so as to show many comparisons in a compact grid of plots). If you follow current standard practice and focus on statistically significant comparisons, you’re losing lots of opportunities to learn.

Avoid the two-tier system. Give respect to a student project or Arxiv paper just as you would to a paper published in Science or Nature.

One last question

Finally, what should we think about research that, ultimately, has no value, where the measurements are so noisy that nothing useful can be learned about the topics under study?

For example, there’s that research on beauty and sex ratio which we’ve discussed so many times (see here for background).

What can we get out of that doomed study?

First, it’s been a great example, allowing us to develop statistical methods for assessing what can be learned from noisy studies of small effects. Second, on this particular example, we’ve learned the negative fact that this research was a dead end. Dead ends happen; this is implied by those Gilbert quotes above. One could say that the researcher who worked on those beauty-and-sex-ratio papers did the rest of us a service by exploring this dead end so that other people don’t have to. That’s a scientific contribution too, even if it wasn’t the original aim of the study.

We should all feel free to speculate in our published papers without fear of overly negative consequences in the (likely) event that our speculations are wrong; we should all be less surprised to find that published research claims did not work out (and that’s one positive thing about the replication crisis, that there’s been much more recognition of this point); and we should all be more willing to modify and even let go of ideas that didn’t happen to work out, even if these ideas were published by ourselves and our friends.

We’ve gone a long way in this direction, both statistically and sociologically. From “the replication rate . . . is statistically indistinguishable from 100%” to “Replication should never be 100%”: This is real progress that I’m happy to see, and it gives me more confidence that we can all work together. Not agreeing on every item, I’m sure, but with a common understanding of the fallacy of published work.