Noisy, heterogeneous data scoured from diverse sources make his metanalyses stronger.

November 10, 2017
By

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Kyle MacDonald writes:

I wondered if you’d heard of Purvesh Khatri’s work in computational immunology, profiled in this Q&A with Esther Landhuis at Quanta yesterday.

Elevator pitch is that he believes noisy, heterogeneous data scoured from diverse sources make his metanalyses stronger. The thing that gave me the woollies was this line:

“We start with dirty data,” he says. “If a signal sticks around despite the heterogeneity of the samples, you can bet you’ve actually found something.”

On the one hand, that seems like an almost verbatim restatement of your “what doesn’t kill my statistical significance makes it stronger” fallacy. On the other hand, he seems to use his methods purely to look for things to test empirically, rather than to draw conclusions based on the analysis, which is good, and might mean that the fallacy doesn’t apply. I also like his desire to look for connections that isolated groups might miss:

I realized that heart transplant surgeons, kidney transplant surgeons and lung transplant surgeons don’t really talk to each other!

I’d be interested in hearing your thoughts: worth the noise if he’s finding connections that no one would have thought to test?

My response:

I haven’t read Khatri’s research articles and I know next to nothing about this field of research so I can’t really say. Based on the above-quoted news article, the work looks great.

Regarding your question: On one hand, yes, it seems mistaken to have more confidence in one’s findings because the data were noisier. On the other hand, it’s not clear that by “dirty data,” he means “noisy data.” It seems that he just means “diverse data” from different settings. And there I agree that it should be better to include and model the variation (multilevel modeling!) than to study some narrow scenario. It also looks like good news that he uses training and holdout sets. That’s something we can’t always do in social science but should be possible in genetics where data are so plentiful.

The post Noisy, heterogeneous data scoured from diverse sources make his metanalyses stronger. appeared first on Statistical Modeling, Causal Inference, and Social Science.



Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

Tags: , ,


Subscribe

Email:

  Subscribe