How to reduce Type M errors in exploratory research?

May 15, 2018

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Miao Yu writes:

Recently, I found this piece [a news article by Janet Pelley, Sulfur dioxide pollution tied to degraded sperm quality, published in Chemical & Engineering News] and the original paper [Inverse Association between Ambient Sulfur Dioxide Exposure and Semen Quality in Wuhan, China, by Yuewei Liu, published in Environmental Science & Technology].

Air pollution research is hot, especially for China. However, I think they might use a tricky way to do those studies. Typically, they collected many samples and analyzed many pollutants in those samples. Then just find the connection between one contaminant and environmental factor or diseases by checking the correlation among all compounds-environmental factor/disease pairs like this study. I have to say such template has been used a lot in environmental studies. Just a game of permutation and combination between thousands exposure factors (now we could detect them at one single run) and thousands of public concerns.

Since such observational study is actually hard to be really randomized, I am uncomfortable about those results. It seems we could use thousands of assumptions between compounds and environmental factor/disease and published the one with “significant” differences and shout at public press. Of course we could control for age, gender, smoking, BMI, etc. However, it’s just hard to control unknown unknown and just blame the known parts. Furthermore, type M error are also behind those study.

Are there suggestions to avoid those kind of errors or studies?

My reply: To start with, I’m not going to address this particular study, which happens to cost $40 to download. The effects of air pollution are an important topic but I think that for most of you there will be more interest in the general issue of how to learn from open-ended, exploratory studies.

So. The easiest answer to the “what to do in general” question is to simply separate the exploration and inference: use the exploratory data, in concert with theory, to come up with some hypotheses and then test them in new preregistered study.

But I don’t like that answer because we want some answers now. I’m not saying we want certainty now, or near-certainty, or statistical significance—but we’d like to give our best estimates from the data we have; we don’t want to be using estimates that are clearly biased.

So what should be done? Here are some suggestions:

1. Forget about statistical significance. Publish all your results and don’t select the results that exceeded some threshold for special treatment. If you’re looking at associations between many different predictors and many different outcomes, show the correlations or coefficients in a big table.

2. Partially pool the estimates toward zero. This can be done using informative priors or with multilevel modeling. You can’t get selection bias down to 0 (the type M error depends on the unknown parameter value) but you can at least reduce it.

3. Control for age, gender, smoking, BMI, etc (which I assume was done in the above-linked study). Adjusting for these predictors will not fix all your problems but, again, it seems like it’s going in the right direction.

The point is that whether we think of our goal as getting the best estimates to make decisions right now, or if we’re just considering this as an exploratory analysis—either way we want to learn as much from the data as possible, and correct for biases as much as we can.

The post How to reduce Type M errors in exploratory research? appeared first on Statistical Modeling, Causal Inference, and Social Science.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

Tags: , , , ,