(This article was originally published at Error Statistics Philosophy » Statistics, and syndicated at StatsBlogs.)

I don’t know how to explain to this *economist blogger* that he is erroneously using p-values when he claims that “the odds are” (1 – p)/p that a null hypothesis is false. Maybe others want to jump in here?

On significance and model validation (Lars Syll)Let us suppose that we as educational reformers have a hypothesis that implementing a voucher system would raise the mean test results with 100 points (null hypothesis). Instead, when sampling, it turns out it only raises it with 75 points and having a standard error (telling us how much the mean varies from one sample to another) of 20.

Does this imply that the data do not disconfirm the hypothesis? Given the usual normality assumptions on sampling distributions, with a t-value of 1.25 [(100-75)/20] the one-tailed p-value is approximately 0.11. Thus, approximately 11% of the time we would expect a score this low or lower if we were sampling from this voucher system population. That means – using the ordinary 5% significance-level, we would not reject the null hypothesis although the test has shown that it is likely – the odds are 0.89/0.11 or 8-to-1 – that the hypothesis is false.….

And as shown over and over again when it is applied, people have a tendency to read “not disconfirmed” as “probably confirmed.” But looking at our example, standard scientific methodology tells us that since there is only 11% probability that pure sampling error could account for the observed difference between the data and the null hypothesis, it would be more “reasonable” to conclude that we have a case of disconfirmation.

Of course, as we’ve discussed many times, failure to reject a null or test hypothesis is not evidence for the null (search this blog). We would however note that for the hypotheses: H0: µ > 100 vs. H1: µ <100, and a failure to reject the null, one is interested in setting severity bounds such as:

*sev(µ > 75)=.5
sev(µ > 60)=.773
sev(µ > 50)=.894
sev(µ > 30)=.988*

So there’s clearly very poor evidence that µ exceeds 75*. Note too that sev(µ < 100)=.89.**

I agree the issue of model validation is always vital– for all statistical approaches. See the unit beginning here.

*As Fisher always emphasized, it requires several tests before regarding an experimental effect as absent or present. One might reserve SEV for such a combined assessment.

**I am very grateful to Aris Spanos for number-crunching for this post, while I’m ‘on the road’.

Filed under: fallacy of non-significance, Severity, Statistics

**Please comment on the article here:** **Error Statistics Philosophy » Statistics**