Vigorous data-handling tied to publication in top journals among public heath researchers

Gur Huberman points us to this news article by Nicholas Bakalar, “Vigorous Exercise Tied to Macular Degeneration in Men,” which begins:

A new study suggests that vigorous physical activity may increase the risk for vision loss, a finding that has surprised and puzzled researchers.

Using questionnaires, Korean researchers evaluated physical activity among 211,960 men and women ages 45 to 79 in 2002 and 2003. Then they tracked diagnoses of age-related macular degeneration, from 2009 to 2013. . . .

They found that exercising vigorously five or more days a week was associated with a 54 percent increased risk of macular degeneration in men. They did not find the association in women.

The study, in JAMA Ophthalmology, controlled for more than 40 variables, including age, medical history, body mass index, prescription drug use and others. . . . an accompanying editorial suggests that the evidence from such a large cohort cannot be ignored.

The editorial, by Myra McGuinness, Julie Simpson, and Robert Finger, is unfortunately written entirely from the perspective of statistical significance and hypothesis testing, but they raise some interesting points nonetheless (for example, that the subgroup analysis can be biased if the matching of treatment to control group is done for the entire sample but not for each subgroup).

The news article is not so great, in my opinion. Setting aside various potential problems with the study (including those issues raised by McGuinness et al. in their editorial), the news article makes the mistake of going through all the reported estimates and picking the largest one. That’s selection bias right there. “A 54 percent increased risk,” indeed. If you want to report the study straight up, no criticism, fine. But then you should report the estimated main effect, which was 23% (as reported in the journal article, “(HR, 1.23; 95% CI, 1.02-1.49)”). That 54% number is just ridiculous. I mean, sure, maybe the effect really is 54%, who knows? But such an estimate is not supported by the data: it’s the largest of a set of reported numbers, any of which could’ve been considered newsworthy. If you take a set of numbers and report only the maximum, you’re introducing a bias.

Part of the problem, I suppose, is incentives. If you’re a health/science reporter, you have a few goals. One is to report exciting breakthroughs. Another is to get attention and clicks. Both goals are served, at least in the short term, by exaggeration. Even if it’s not on purpose.

OK, on to the journal article. As noted above, it’s based on a study of 200,000 people: “individuals between ages 45 and 79 years who were included in the South Korean National Health Insurance Service database from 2002 through 2013,” of whom half engaged in vigorous physical activity and half did not. It appears that the entire database contained about 500,000 people, of which 200,000 were selected for analysis in this comparison. The outcome is neovascular age-related macular degeneration, which seems to be measured by a prescription for ranibizumab, which I guess was the drug of choice for this condition in Korea at that time? Based on the description in the paper, I’m assuming they didn’t have direct data on the medical conditions, only on what drugs were prescribed, and when, hence “ranibizumab use from August 1, 2009, indicated a diagnosis of recently developed active (wet) neovascular AMD by an ophthalmologist.” I don’t know if there were people with neovascular AMD who which was not captured in this dataset because they never received this diagnosis.

In their matched sample of 200,000 people, 448 were recorded as having neovascular AMD: 250 in the vigorous exercise group and 198 in the control group. The data were put into a regression analysis, yielding an estimated hazard ratio of 1.23 with 95% confidence interval of [1.02, 1.49]. Also lots of subgroup analyses: unsurprisingly, the point estimate is higher for some subgroups than others; also unsurprisingly, some of the subgroup analyses reach statistically significance and some are not.

It is misleading to report that vigorous physical activity was associated with a greater hazard rate for neovascular AMD in men but not in women. Both the journal article and the news article made this mistake. The difference between “significant” and “non-significant” is not itself statistically significant.

So what do I think about all this? First, the estimates are biased due to selection on statistical significance (see, for example, section 2.1 here). Second, given how surprised everyone is, this suggests a prior distribution on any effect that should be concentrated near zero, which would pull all estimates toward 0 (or pull all hazard ratios toward 1), and I expect that the 95% intervals would then all include the null effect. Third, beyond all the selection mentioned above, there’s the selection entailed in studying this particular risk factor and this particular outcome. In this big study, you could study the effect of just about any risk factor X on just about any outcome Y. I’d like to see a big grid of all these things, all fit with a multilevel model. Until then, we’ll need good priors on the effect size for each study, or else some corrections for type M and type S errors.

Just reporting the raw estimate from one particular study like that: No way. That’s a recipe for future non-replicable results. Sorry, NYT, and sorry, JAMA: you’re gettin played.

P.S. Gur wrote:

The topic may merit two posts — one for the male subpopulation, another for the female.

To which I replied:

20 posts, of which 1 will be statistically significant.

P.P.S. On the plus side, Jonathan Falk pointed me the other day to this post by Scott Alexander, who writes the following about a test of a new psychiatric drug:

The pattern of positive results shows pretty much the random pattern you would expect from spurious findings. They’re divided evenly among a bunch of scales, with occasional positive results on one scale followed by negative results on a very similar scale measuring the same thing. Most of them are only the tiniest iota below p = 0.05. Many of them only work at 40 mg, and disappear in the 80 mg condition; there are occasional complicated reasons why drugs can work better at lower doses, but Occam’s razor says that’s not what’s happening here. One of the results only appeared in Stage 2 of the trial, and disappeared in Stage 1 and the pooled analysis. This doesn’t look exactly like they just multiplied six instruments by two doses by three ways of grouping the stages, got 36 different cells, and rolled a die in each. But it’s not too much better than that. Who knows, maybe the drug does something? But it sure doesn’t seem to be a particularly effective antidepressant, even by our very low standards for such. Right now I am very unimpressed.

It’s good to see this mode of thinking becoming so widespread. It makes me feel that things are changing in a good way.

So, some good news for once!