Don’t define reproducibility based on p-values

April 9, 2018

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Lizzie Wolkovich writes:

I just got asked to comment on this article [“Genotypic variability enhances the reproducibility of an ecological study,” by Alexandru Milcu et al.
]—I have yet to have time to fully sort out their stats but the first thing that hit me about it was they seem to be suggesting a way to increase reproducibility is to increase some aspect that leads to important variation in the experiment (like genotypic variation in plants, which we know is important). But that doesn’t seem to make sense!

My response:

Regarding the general issue, I had a conversation with Paul Rosenbaum once about choices in design of experiments, where one can decide to perform: (a) a focused experiment with very little variation on x, which should improve precision but harm generalizability; or (b) a broader experiment in which one purposely chooses a wide range of x, which should reduce precision in estimation but allow the thing being estimated to be more relevant for out-of-sample applications. That sounds related to what’s going on here.

Regarding this particular paper, I am finding the details hard to follow, in part because they aren’t always so clear in distinguishing between data and parameters. For example, they write, “the net legume effect on mean total plant biomass varied among laboratories from 1.31 to 6.72 g dry weight (DW) per microcosm in growth chambers, suggesting that unmeasured laboratory-specific conditions outweighed effects of experimental standardization.” But I assume they are referring not to the effect but to the estimated effect, so that some of this variation could be explained as estimation error.

I also find it frustrating to read a paper about replication in which decisions are made based on statistical significance; for example, see lines 174-184 of text, and, even more explicitly, on lines 187-188: “To answer the question of how many laboratories produced results that were statistically indistinguishable from one another (i.e. reproduced the same finding) . . .”

Also there are comparisons of significance and non-significance, for example this: “Introducing genotypic CSV increased reproducibility in growth chambers but not in glasshouses,” followed by post-hoc explanations: “This observation is in line with the hypothesis put forward by Richter et al. . . .”

This is not to say that the claims in this paper are wrong, just that I’m finding it difficult to make sense of this paper and understand exactly what they mean by reproducibility, which is never defined in the paper.

Lizzie replied:

Yes, the theme of the paper seems to be, “When all you care about is an asterisk above your bargraph in one paper, but no asterisks when you compare papers.” They also do define reproducibility: “Because we considered that statistically significant differences among the 14 laboratories would indicate a lack of reproducibility….”

I guess what we’re saying here is that reproducibility is important, but defining it based on p-values is a mistake, it’s kinda sending you around in circles.

The post Don’t define reproducibility based on p-values appeared first on Statistical Modeling, Causal Inference, and Social Science.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

Tags: ,