How post-hoc power calculation is like a shit sandwich

Damn. This story makes me so frustrated I can’t even laugh. I can only cry.

Here’s the background. A few months ago, Aleksi Reito (who sent me the adorable picture above) pointed me to a short article by Yanik Bababekov, Sahael Stapleton, Jessica Mueller, Zhi Fong, and David Chang in Annals of Surgery, “A Proposal to Mitigate the Consequences of Type 2 Error in Surgical Science,” which contained some reasonable ideas but also made a common and important statistical mistake.

I was bothered to see this mistake in an influential publication. Instead of blogging it, this time I decided to write a letter to the journal, which they pretty much published as is.

My letter went like this:

An article recently published in the Annals of Surgery states: “as 80% power is difficult to achieve in surgical studies, we argue that the CONSORT and STROBE guidelines should be modified to include the disclosure of power—even if <80%---with the given sample size and effect size observed in that study”. This would be a bad idea. The problem is that the (estimated) effect size observed in a study is noisy, especially so in the sorts of studies discussed by the authors. Using estimated effect size can give a terrible estimate of power, and in many cases can lead to drastic overestimates of power . . . The problem is well known in the statistical and medical literatures . . . That said, I agree with much of the content of [Bababekov et al.] . . . I appreciate the concerns of [Bababekov et al.] and I agree with their goals and general recommendations, including their conclusion that “we need to begin to convey the uncertainty associated with our studies so that patients and providers can be empowered to make appropriate decisions.” There is just a problem with their recommendation to calculate power using observed effect sizes.

I was surgically precise, focusing on the specific technical error in their paper and separating this from their other recommendations.

And the letter was published, with no hassle! Not at all like my frustrating experience with the American Sociological Review.

So I thought the story was over.

But then my blissful slumber was interrupted when I received another email from Reito, pointing to a response in that same journal by Bababekov and Chang to my letter and others. Bababekov and Chang write:

We are greatly appreciative of the commentaries regarding our recent editorial . . .

So far, so good! But then:

We respectfully disagree that it is wrong to report post hoc power in the surgical literature. We fully understand that P value and post hoc power based on observed effect size are mathematically redundant; however, we would point out that being redundant is not the same as being incorrect. . . . We also respectfully disagree that knowing the power after the fact is not useful in surgical science.

No! My problem is not that their recommended post-hoc power calculations are “mathematically redundant”; my problem is that their recommended calculations will give wrong answers because they are based on extremely noisy estimates of effect size. To put it in statistical terms, their recommended method has bad frequency properties.

I completely agree with the authors that “knowing the power after the fact” can be useful, both in designing future studies and in interpreting existing results. John Carlin and I discuss this in our paper. But the authors’ recommended procedure of taking a noisy estimate and plugging it into a formula does not give us “the power”; it gives us a very noisy estimate of the power. Not the same thing at all.

Here’s an example. Suppose you have 200 patients: 100 treated and 100 control, and post-operative survival is 94 for the treated group and 90 for the controls. Then the raw estimated treatment effect is 0.04 with standard error sqrt(0.94*0.06/100 + 0.90*0.10/100) = 0.04. The estimate is just one s.e. away from zero, hence not statistically significant. And the crudely estimated post-hoc power, using the normal distribution, is approximately 16% (the probability of observing an estimate at least 2 standard errors away from zero, conditional on the true parameter value being 1 standard error away from zero). But that’s a noisy, noisy estimate! Consider that effect sizes consistent with these data could be anywhere from -0.04 to +0.12 (roughly), hence absolute effect sizes could be roughly between 0 and 3 standard errors away fro zero, corresponding to power being somewhere between 5% (if the true population effect size happened to be zero) and 97.5% (if the true effect size were three standard errors from zero). That’s what I call noisy.

Here’s an analogy that might help. Suppose someone offers me a shit sandwich. I’m not gonna want to eat it. My problem is not that it’s a sandwich, it’s that it’s filled with shit. Give me a sandwich with something edible inside; then we can talk.

I’m not saying that the approach that Carlin and I recommend—performing design analysis using substantively-based effect size estimates—is trivial to implement. As Bababekov and Chang write in their letter, “it would be difficult to adapt previously reported effect sizes to comparative research involving a surgical innovation that has never been tested.”

Fair enough. It’s not easy, and it requires assumptions. But that’s the way it works: if you want to make a statement about power of a study, you need to make some assumption about effect size. Make your assumption clearly, and go from there. Bababekov and Chang write: “As such, if we want to encourage the reporting of power, then we are obliged to use observed effect size in a post hoc fashion.” No, no, and no. You are not obliged to use a super-noisy estimate. You were allowed to use scientific judgment when performing that power analysis you wrote for your grant proposal, before doing the study, and you’re allowed to use scientific judgment when doing your design analysis, after doing the study.

The whole thing is so frustrating.

Look. I can’t get mad at the authors of this article. They’re doing their best, and they have some good points to make. They’re completely right that authors and researchers should not “misinterpret P > 0.05 to mean comparison groups are equivalent or ‘not different.'” This is an important point that’s not well understood; indeed my colleagues and I recently wrote a whole paper on the topic, actually in the context of a surgical example. Statistics is hard. The authors of this paper are surgeons, not statisticians. I’m a statistician and I don’t know anything about surgery; no reason to expect these two surgeons to know anything about statistics. But, it’s still frustrating.

P.S. After writing the above post a few months ago, I submitted it (without some features such as the “shit sandwich” line) as a letter to the editor of the journal. To its credit, the journal is publishing the letter. So that’s good.