(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

David Austin pointed me to this article by Leah Jager and Jeffrey Leek. The title is funny but the article is serious:

The accuracy of published medical research is critical both for scientists, physicians and patients who rely on these results. But the fundamental belief in the medical literature was called into serious question by a paper suggesting most published medical research is false. Here we adapt estimation methods from the genomics community to the problem of estimating the rate of false positives in the medical literature using reported P‐values as the data. We then collect P‐values from the abstracts of all 77,430 papers published in The Lancet, The Journal of the American Medical Association, The New England Journal of Medicine, The British Medical Journal, and The American Journal of Epidemiology between 2000 and 2010. We estimate that the overall rate of false positives among reported results is 14% (s.d. 1%), contrary to previous claims. We also find there is not a significant increase in the estimated rate of reported false positive results over time (0.5% more FP per year, P = 0.18) or with respect to journal submissions (0.1% more FP per 100 submissions, P = 0.48). Statistical analysis must allow for false positives in order to make claims on the basis of noisy data. But our analysis suggests that the medical literature remains a reliable record of scientific progress.

Jager and Leek may well be correct in their larger point, that the medical literature is broadly correct. But I don’t think the statistical framework they are using is appropriate for the questions they are asking. My biggest problem is the identification of *scientific* hypotheses and *statistical* “hypotheses” of the “theta = 0″ variety.

Here’s what I think is going on. Medical researchers are mostly studying real effects (certain wacky examples aside). But there’s a lot of variation. A new treatment will help in some cases and hurt in others. Also, studies are not perfect, there are various sorts of measurement error and selection bias that creep in, hence even the occasionally truly zero effect will not be zero in statistical expectation (i.e., with a large enough study, effects will be found). Nonetheless, there is such a thing as an error. It’s not a type 1 or type 2 error in the classical sense (and as considered by Jager and Leek), rather there are Type S errors (someone says an effect is positive when it’s actually negative) and Type M errors (someone says an effect is large when it’s actually small, or vice versa). For example, the notorious study of beauty and sex ratios was a Type M error: the claim was an 8 percentage point difference in the probability of a girl (comparing the children of beautiful and non-beautiful parents), but I’m pretty sure any actual difference is 0.3 percentage points or less, it could go in either direction, and there’s no reason to suppose it persists over time. The point in that example is not that the true effect is or is not zero (thus making the original claim “false” or “true”) but rather that the study is noninformative. If it got the sign right it’s by luck, and in any case it’s overestimating the magnitude of any difference by more than an order of magnitude.

Yes, I recognize that my own impressions may be too strongly influenced by my own experiences (very non-statistical of me); nonetheless, I see this whole false-positive, true-positive framework as a dead end.

Now to the details of the paper. Based on the word “empirical” title, I thought the authors were going to look at a large number of papers with p-values and then follow up and see if the claims were replicated. But no, they don’t follow up on the studies at all! What they seem to be doing is collecting a set of published p-values and then fitting a mixture model to this distribution, a mixture of a uniform distribution (for null effects) and a beta distribution (for non-null effects). Since only statistically significant p-values are typically reported, they fit their model restricted to p-values less than 0.05. But this all assumes that the p-values have this stated distribution. You don’t have to be Uri Simonsohn to know that there’s a lot of p-hacking going on. Also, as noted above, the problem isn’t really effects that are exactly zero, the problem is that a lot of effects are lots in the noise and are essentially undetectable given the way they are studied.

Jager and Leek write that their model is commonly used to study hypotheses in genetics and imaging. I could see how this model could make sense in those fields: First, at least in genetics I could imagine a very sharp division between a small number of nonzero effects and a large number of effects that are essentially null. Second, in these fields, a researcher is analyzing a big data dump and gets to see all the estimates and all the p-values at once, so at that initial stage there is no p-hacking or selection bias. But I don’t see this model applying to published medical research, for two reasons. First, as noted above, I don’t think there would be a sharp division between null and non-null effects; and, second, there’s just too much selection going on for me to believe that the conditional distributions of the p-values would be anything like the theoretical distributions suggested by Neyman-Pearson theory.

So, no, I don’t at all believe Jager and Leek when they write, “we are able to empirically estimate the rate of false positives in the medical literature and trends in false positive rates over time.” They’re doing this by basically assuming the model that is being questioned, the textbook model in which effects are pure and in which there is no p-hacking.

I hate to be so negative—they have a clever idea and I think they mean well. But I think this sort of analysis reveals little more than the problems arise when you take statistical jargon such as “hypothesis” too seriously.

P.S. Jager and Leek note that they’ve put all their data online so that others can do their own analyses. Also see Leek’s reply in comments.

P.P.S. More from Leek. To respond briefly to Leek’s comments: (1) No, my point about Type 1 errors is not primarily “semantics” or “philosophy.” I agree with Leek that his framework is clear—my problem is that I don’t think it applies well to reality. As noted above, I don’t think his statistical model of hypotheses corresponds to actual scientific hypotheses in general. (2) When I remarked that Jager and Leek did not follow up the published studies to see which were true (however “true” is defined), I was criticizing their claim to be “empirical.” They write, “we are able to empirically estimate the rate of false positives in the medical literature and trends in false positive rates over time”—but I don’t see this as an empirical estimate at all, I see it as almost entirely model-based. To me, an empirical estimate of the rate of false positives would use empirical data on positives and negatives. (3) Leek does some new simulation studies. That seems like a good direction to pursue.

P.P.P.S. Just to clarify: I think what Jager and Leek are trying to do is hopeless. So it’s not a matter of them doing it wrong, I just don’t think it’s possible to analyze a collection of published p-values and, from that alone, infer anything interesting about the distribution of true effects. It’s just too assumption-driven. You’re basically trying to learn things from the shape of the distribution, and to get anywhere you have to make really strong, inherently implausible assumptions. These estimates just can’t be “empirical” in any real sense of the word. It’s fine to do some simulations and see what pops up, but I think it’s silly to claim that this has any direct bearing on claims of scientific truth or progress.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**