Breaking the dataset into little pieces and putting it back together again

June 19, 2017

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Alex Konkel writes:

I was a little surprised that your blog post with the three smaller studies versus one larger study question received so many comments, and also that so many people seemed to come down on the side of three smaller studies. I understand that Stephen’s framing led to some confusion as well as practical concerns, but I thought the intent of the question was pretty straightforward.

At the risk of beating a dead horse, I wanted to try asking the question a different way: if you conducted a study (or your readers, if you want to put this on the blog), would you ever divide up the data into smaller chunks to see if a particular result appeared in each subset? Ignoring cases where you might want to examine qualitatively different groups, of course; would you ever try to make fundamentally homogeneous/equivalent subsets? Would you ever advise that someone else do so?

For those caught up in the details, assume an extremely simple design. A simple comparison of two groups ending in a (Bayesian) t-test with no covariates, nothing fancy. In a very short time period you collected 450 people in each group using exactly the same procedure for each one; there is zero reason to believe that the data were affected by anything other than your group assignment. Would you forego analyzing the entire sample and instead break them into three random chunks?

My personal experience is that empirically speaking, no one does this. Except for cases where people are interested in avoiding model overfitting and so use some kind of cross validation or training set vs testing set paradigm, I have never seen someone break their data into small groups to increase the amount of information or strengthen their conclusions. The blog comments, however, seem to come down on the side of this being a good practice. Are you (or your readers) going to start doing this?

My reply:

From a Bayesian standpoint, the result is the same, whether you consider all the data at once, or stir in the data one-third at a time. The problem would come if you make intermediate decisions that involve throwing away information, for example if you take parts of the data and just describe them as statistically significant or not.

The post Breaking the dataset into little pieces and putting it back together again appeared first on Statistical Modeling, Causal Inference, and Social Science.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

Tags: ,