# He’s a history teacher and he has a statistics question

Someone named Ian writes:

I am a History teacher who has become interested in statistics! The main reason for this is that I’m reading research papers about teaching practices to find out what actually “works.”

I’ve taught myself the basics of null hypothesis significance testing, though I confess I am no expert (Maths was never my strong point at school!). But I also came across your blog after I heard about this “replication crisis” thing.

I wanted to ask you a question, if I may.

Suppose a randomised controlled experiment is conducted with two groups and the mean difference turns out to be statistically significant at the .05 level. I’ve learnt from my self-study that this means:

“If there were genuinely no difference in the population, the probability of getting a result this big or bigger is less than 5%.”

So far, so good (or so I thought).

But from my recent reading, I’ve gathered that many people criticise studies for using “small samples.” What was interesting to me is that they criticise this even after a significant result has been found.

So they’re not saying “Your sample size was small so that may be why you didn’t find a significant result.” They’re saying: “Even though you did find a significant result, your sample size was small so your result can’t be trusted.”

I was just wondering whether you could explain why one should distrust significant results with small samples? Some people seem to be saying it’s because it may have been a chance finding. But isn’t that what the p-value is supposed to tell you? If p is less then 0.05, doesn’t that mean I can assume it (probably) wasn’t a “chance finding”?

My reply: See my paper, “The failure of null hypothesis significance testing when studying incremental changes, and what to do about it,” recently published in the Personality and Social Psychology Bulletin. The short answer is that (a) it’s not hard to get p less than 0.05 just from chance, via forking paths, and (b) when effect sizes are small and a study is noisy, any estimate that reaches “statistical significance” is likely to be an overestimate, perhaps a huge overestimate.