The necessity—and the difficulty—of admitting failure in research and clinical practice

June 10, 2018

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Bill Jefferys sends along this excellent newspaper article by Siddhartha Mukherjee, “A failure to heal,” about the necessity—and the difficulty—of admitting failure in research and clinical practice. Mukherjee writes:

What happens when a clinical trial fails? This year, the Food and Drug Administration approved some 40 new medicines to treat human illnesses, including 13 for cancer, three for heart and blood diseases and one for Parkinson’s. . . . Yet the vastly more common experience in the life of a clinical scientist is failure: A pivotal trial does not meet its expected outcome. What happens then? . . .

The first thing you feel when a trial fails is a sense of shame. You’ve let your patients down. You know, of course, that experimental drugs have a poor track record — but even so, this drug had seemed so promising (you cannot erase the image of the cancer cells dying under the microscope). You feel as if you’ve shortchanged the Hippocratic oath. . . .

There’s also a more existential shame. In an era when Big Pharma might have macerated the last drips of wonder out of us, it’s worth reiterating the fact: Medicines are notoriously hard to discover. The cosmos yields human drugs rarely and begrudgingly — and when a promising candidate fails to work, it is as if yet another chemical morsel of the universe has been thrown into the Dumpster. The meniscus of disappointment rises inside you . . .

And then a second instinct takes over: Why not try to find the people for whom the drug did work? . . . This kind of search-and-rescue mission is called “post hoc” analysis. It’s exhilarating — and dangerous. . . . The reasoning is fatally circular — a just-so story. You go hunting for groups of patients that happened to respond — and then you turn around and claim that the drug “worked” on, um, those very patients that you found. (It’s quite different if the subgroups are defined before the trial. There’s still the statistical danger of overparsing the groups, but the reasoning is fundamentally less circular.) . . .

Perhaps the most stinging reminder of these pitfalls comes from a timeless paper published by the statistician Richard Peto. In 1988, Peto and colleagues had finished an enormous randomized trial on 17,000 patients that proved the benefit of aspirin after a heart attack. The Lancet agreed to publish the data, but with a catch: The editors wanted to determine which patients had benefited the most. Older or younger subjects? Men or women?

Peto, a statistical rigorist, refused — such analyses would inevitably lead to artifactual conclusions — but the editors persisted, declining to advance the paper otherwise. Peto sent the paper back, but with a prank buried inside. The clinical subgroups were there, as requested — but he had inserted an additional one: “The patients were subdivided into 12 … groups according to their medieval astrological birth signs.” When the tongue-in-cheek zodiac subgroups were analyzed, Geminis and Libras were found to have no benefit from aspirin, but the drug “produced halving of risk if you were born under Capricorn.” Peto now insisted that the “astrological subgroups” also be included in the paper — in part to serve as a moral lesson for posterity.

I actually disagree with Peto—not necessarily for that particular study, but considering the subgroup problem more generally. I mean, sure, I agree that raw comparisons can be noisy, but with a multilevel model it should be possible to study lots of comparisons and just partially pool these toward zero.

That said, I agree with the author’s larger point that it would be good if researchers could just admit that sometimes an experiment is just a failure, that their hypothesis didn’t work and it’s time to move on.

I recently encountered an example in political science where the researcher had a preregistered hypothesis, did the experiment, and the result was in the wrong direction and not statistically significant: a classic case of a null finding. But the researcher didn’t give up, instead reporting the result was statistically significant at the 10% level, explaining that even though the result was in the wrong direction, that was consistent with theory also, and also reporting some interactions. That’s a case where the appropriate multilevel model would’ve partially pooled everything toward zero, or, alternatively, Peto’s just-give-up strategy would’ve been fine too. Or, not giving up but being clear that your claims are not strongly supported by the data, that’s ok. But it was not ok to claim strong evidence in this case; that’s a case of people using statistical methods to fool themselves.

To return to Mukherjee’s article:

Why do we do it then? Why do we persist in parsing a dead study — “data dredging,” as it’s pejoratively known? One answer — unpleasant but real — is that pharmaceutical companies want to put a positive spin on their drugs, even when the trials fail to show benefit. . . .

The less cynical answer is that we genuinely want to understand why a medicine doesn’t work. Perhaps, we reason, the analysis will yield an insight on how to mount a second study — this time focusing the treatment on, say, just men over 60 who carry a genetic marker. We try to make sense of the biology: Maybe the drug was uniquely metabolized in those men, or maybe some physiological feature of elderly patients made them particularly susceptible.

Occasionally, this dredging will indeed lead to a successful follow-up trial (in the case of O, there’s now a new study focused on the sickest patients). But sometimes, as Peto reminds us, we’ll end up chasing mirages . . .

I think Mukherjee’s right: it’s not all about cynicism. Researchers really do believe. The trouble is that raw estimates selected on statistical significance give biased estimates (see section 2.1 of this paper). To put it another way: if you have the longer-term goal of finding interesting avenues to pursue for future research, that’s great—and the way to do this is not to hunt for “statistically significant” differences in your data, but rather to model the entire pattern of your results. Running your data through a statistical significance filter is just a way to add noise.

The post The necessity—and the difficulty—of admitting failure in research and clinical practice appeared first on Statistical Modeling, Causal Inference, and Social Science.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

Tags: , ,