Fitting multilevel models when the number of groups is small

Matthew Poes writes:

I have a question that I think you have answered for me before. There is an argument to be made that HLM should not be performed if a sample is too small (too small level 2 and too small level 1 units). Lot’s of papers written with guidelines on what those should be. It’s my understanding that those guidelines may not be worth much and I believe even you have suggested that when faced with small samples, it is probably better to just simulate.

Is it accurate to say that if a data set is clearly nested, there is dependence, and the sample is too small to do HLM, that no analysis is ok. That a different analysis that doesn’t address dependence but is not necessarily as biased with small samples (or so they say) is still not ok. I think you mentioned this before.

Let’s say you want to prove that head start centers that measure as having higher “capacity” (as measured on a multi-trait multi-method assessment of capacity) have teachers that are more “satisfied” with their jobs, that simply looking at the correlation between site capacity and site average job satisfaction is not ok if you only have 15 sites (and 50 total teachers unequally distributed amongst these sites). This is a real question I’ve been given with the names and faces changed. My instinct is they aren’t analyzing the question they asked and this isn’t right.

Would the use of a Bayesian GLM be an option or am I expecting too much magic here? This isn’t my study, but I hate to go back to someone and say, Hey sorry, you spent 2 years and there is nothing you can do quantitatively here (Though I much rather say that then allow this correlation to be published).

My quick response is that the model is fine if you’re not data-rich; it’s just that in such a setting the prior distribution is more important. Flat priors will not make sense because they allow the possibility of huge coefficients that are not realistic. My book with Hill is not at all clear on this point, as we pretty much only use flat priors, and we don’t really wrestle with the problems that this can cause. Moving forward, though, I think the right thing to do is to fit multilevel models with informative priors. Setting up these priors isn’t trivial but it’s not impossible either; see for example the bullet points on page 13 of this article for an example in a completely different context. As always, it would be great to have realistic case studies of this sort of thing (in this case, informative priors for multilevel models in analyses of social science interventions) that people can use as templates for their own analyses. We should really include once such example in Advanced Regression and Multilevel Models, the in-preparation second edition of the second half of our book.

Short-term, for your problem now, I recommend the multilevel models with informative priors. I’m guessing it will be a lot easier to do this than you might think.

Poes then replied:

That example came from a real scenario where a prior study actually had unusually high coefficients. It was an intervention designed for professional development of practitioners. In general, most studies of a similar nature have had no or little effect. An effect size is .2 to .5 is pretty common. This particular intervention was not so unusual as to expect much higher effects, but they ended up with effects closer to .8 or so, and the sample was very small (it was a pilot study). They used that evidence as a means to justify a second small study. I suspect there is a great deal more uncertainty in those findings than it appears to the evaluation team, and I suspect if priors from those earlier studies were to be included, the coefficients would be more reasonable. The second study has not yet been completed, but I will be shocked if they see the same large effects.

This is an exaggeration, but to put this large effect into perspective, it would be as if we are suggesting that spending an extra ten minutes a day with hands on supervision of preschool teachers would lead to their students knowing ten more letters by the end of the year. I think you have addressed this before, but I do think people sometimes forget to take a step back from their statistics to consider what those statistics mean in practical terms.

Poes also added:

While we are talking about these studies as if Bayesian analysis would be used, they are in fact all analyzed using frequentist methods. I’m not sure if that was clear.

And then he had one more question:

When selecting past studies to use as informative priors, does the quality of the research matter? I have to imagine the answer is yes. A common argument I hear against looking to past results as evidence for current or future results is that the past research is of insufficient quality. Sample too small, measures too noisy, theory of change ill-thought-out, etc. My guess is that it does matter and those issues all potentially matter, but . . . It seems like that then raises the question, at what point is the quality sufficiently bad to merit exclusion? Based on what criteria? Study rating systems (e.g. Consort) exist, but I’m assuming that is not a common part of the process and I would also guess that much of the criteria is unimportant for their use as a prior. I’ve worked on a few study rating tools (including one that is in the process of being published as we speak) and my experience has been that a lot of concessions are made to ensure at least some studies make it through. To go back to my earlier question, I had pointed out that sample size adequacy shouldn’t be based on a fixed number (e.g. at least 100 participants) and maybe not based on the existence of a power analysis, but rather something more nuanced.

This brings me back to my general recommendation that researchers have a “paper trail” to justify their models, including their choice of prior distributions. I have no easy answers here, but, as usual, the default flat prior can cause all sorts of havoc, so I think it’s worth thinking hard about how large you can expect effect sizes to be, and what substantive models correspond to various assumed distributions of effect sizes.

P.S. Yes, this question comes up a lot! For example, a quick google search reveals:

Multilevel models with only one or two groups (from 2006)

No, you don’t need 20 groups to do a multilevel analysis (from 2007)

Hierarchical modeling when you have only 2 groups: I still think it’s a good idea, you just need an informative prior on the group-level variation (from 2015)