Why “statistical significance” doesn’t work: An example.

Reading some of the back-and-forth in this thread, it struck me that some of the discussion was about data, some was about models, some was about underlying reality, but none of the discussion was driven by statements that this or that pattern in data was “statistically significant.”

Here’s the problem with “statistical significance” as I typically see it used. I see statistical significance used in 4 ways, all of them problematic:

1. Researcher has certain goals in mind, uses forking paths to find a statistically significant result consistent with a pre-existing story.

2. Researcher finds a non-significant result and identifies it as zero.

3. Researcher has a pile of results and agnostically uses statistical significance to decide what is real and what is not.

4. Community of researchers use p-values to distinguish between different theories.

All these are bad. Approaches 1 and 2 are obviously bad, in that statistical significance is being used to imply empirical support beyond what can really be learned from the data. Approaches 3 and 4 are bad in a different way, in that they are taking whatever process of scientific discussion and learning is happening, and sprinkling it with noise.