Neyman, confronted with unfortunate news would always say “too bad!” At the end of Jerzy Neyman’s birthday week, I cannot help imagining him saying “too bad!” as regards some twists and turns in the statistics wars. First, too bad Neyman-Pearson (N-P) tests aren’t in the ASA Statement (2016) on P-values: “To keep the statement reasonably simple, we did not address alternative hypotheses, error types, or power”. An especially aggrieved “too bad!” would be earned by the fact that those in love with confidence interval estimators don’t appreciate that Neyman developed them (in 1930) as a method with a precise interrelationship with N-P tests. So if you love CI estimators, then you love N-P tests!
Consider a typical N-P test of the mean of a Normal distribution T+: H0: µ ≤ µ0 vs H1: µ > µ0.
Imagine σ is known, since nothing of interest to the logic changes if it is estimated as is more typical. Notice the null hypothesis is composite, it is not a point, and the alternative is explicit (you can’t jump from a small P-value to some theory that would “explain ” it).[i]
The (1 – α) confidence interval (CI) corresponding to test T+ is that µ > the (1 – α) lower bound:
µ > M – ca(σ/ √n ).
M is the sample mean, and this is the generic lower confidence bound. Replacing M with the observed sample mean M0 yields the particular CI lower bound.
Why does µ > M – ca(σ/ √n ) correspond the above test T+? Why is it an inversion or dual to the test?
Consider, said Neyman, that the values of µ that exceed M0 – ca(σ/ √n ) are values of µ that could not be rejected at level α with sample mean M0. Equivalently, these are values of the parameter µ that M0 is not statistically significantly greater than at a P-value of α. Yes CIs correspond to Neyman-Pearson tests and were developed by Neyman in 1930, a bit after Fisher’s Fiducial intervals. Yes, those doing CIs (the so-called “new” statistics) are doing Neyman-Pearson tests, only inverted. Neyman didn’t care if you called them hypothesis tests or significance tests (as we saw in my last post). [ii]
Thanks to the duality between tests and confidence intervals, you could give the information provided by a confidence interval at any level in terms of the corresponding test. For a two-sided, 95% confidence interval [µL ,µU].
µL is the (parameter) value that the sample mean is just statistically significantly greater than at the P= .025 level.
µU is the (parameter) value that the sample mean is just statistically significantly lower than at the P= .025 level.
That means it is wrong to say you cannot ascertain anything about the population effect size using P-value computations. You can. It’s not the only way. You can also use P-value functions (Fraser, Cox), power, and severity, but they are all interrelated.
You ask: Please tell me the value of µ that the sample mean M0 is just statistically significantly greater than, at the P= .025 level? The answer is the lower confidence bound µL
If the tester is able to determine the P-value corresponding to a specific value of µ you wanted to test, then she is also able to use the observed M0 to compute the value µL
Likewise for finding µU . All the information is there.
But choosing a single confidence level is quite inadequate. Yet that is still what members of today’s “new” CI tribe do–generally .95. They get very upset at your dichotomizing P ≤ 0.05 and P > 0.05, but happily dichotomize µ is in or out of the CI formed.
The severe tester always infers a discrepancy that is well indicated (if any) but also at least one that is poorly indicated. In relation to test T+, the inference µ > M0 where M0 is the observed mean is a good benchmark for a terrible inference! It corresponds to a lower confidence bound at level 0.5! And yet, critics of significance tests very often advocate inferring alternative
µ > M0
as either comparatively more likely or probable than the null or test hypothesis. For detailed examples, see SIST Excursion 4 Tour II Rejection Fallacies: Who’s Exaggerating What?
So why are members of the Confidence Interval tribe going around misrepresenting hypothesis tests as if they must take the form of Fisherian “simple” significance tests with a point null (nil) hypothesis, usually of 0? (N-P tests were purposely designed to improve upon Fisher’s tests, and it’s that improvement that gives you CIs.) And why do they say what’s inferred with a CI cannot be ascertained with N-P tests? Are they unaware they’re using N-P tests? Or is the simple Fisherian test (no explicit alternative, no consideration of power) just much easier to criticize? If they’re cousins or brothers, why the family feud? Sibling rivalry? Why be a Unitarian? Most testers would supply a P-value as well as a CI. The severe tester combines the two, so that discrepancies are directly reported from test results. For another reason, see [iii].
Critics of tests from outside the family, will also take the simple “nil” point null vs a two-sided alternative as their foil, and demonstrate that the p-value ≠ either their Bayes Factor or posterior probability. It serves as a convenient straw test to knock down. If they kept the comparison to one-sided tests, they would not disagree (at least not with any sensible prior). See SIST Excursion 4 Tour II Rejection Fallacies: Who’s Exaggerating What? This is shown by Casella and R. Berger (1987) and the reconciliation is agreed to by Berger and Sellke (1987).
I’m not saying the simple significance test doesn’t have uses; it’s vital for testing assumptions of statistical models. That’s why Bayesians who want to check their models can be found sneaking P-value goodies from the tests that many of them profess to dislike. If a small P-value indicates a discrepancy from the null there, it does so in other uses too. [iv]
Note too the connection between confidence intervals and severity: Taking a sample mean M that is just statistically significant at level α (Mα) as warranting µ > µ0 with severity 1 – α is the same as inferring µ > M0– ca(σ/ √n ) at confidence level 1 – α. However, severity improves on CIs by breaking out of the single confidence level, providing an inferential justification (rather than merely a long-run coverage rationale), and avoids a number of fallacies and paradoxes of ordinary CIs. For a post on CIs and severity see here. Also see: Do CIs Avoid Fallacies of Tests? Reforming the Reformers. For a full discussion, see SIST.
[i] The null and alternative would be treated symmetrically. You are to choose the null, or more properly, what Neyman called the test hypothesis, according to which error was more serious. A lot of the agony that has people up in arms regarding the fallacy of taking non-significant results as evidence for a (point) null is immediately scotched by letting the test hypothesis be “an effect exists” (or an effect of a given magnitude is present). For example, T-: H0: µ ≥ µ0 vs H1: µ < µ0.
[ii] Note the equivalences:
µ < M – ca(σ/ √n ) iff M > µ + ca(σ/ √n )
So µ < CI lower at confidence level 1 – α iff M reaches statistical significance at P = α in test T+.
Iff = if and only if.
[iii] Some prefer CIs to corresponding tests because it’s easier to slide the confidence level onto the interval estimate, viewing it as affording a probability assignment to the interval itself. This of course is, strictly, a fallacy, unless one just stipulates: I assign “probability” .95, say, to the result of applying a method if that method has .95 “coverage probability”. This is/was the Fiducial dream. But one cannot do probability computations with these assignments. For the severe tester’s evidential interpretation of CIs, please see SIST, Excursion 3 Tour III.
[iv] Moving from a discrepancy (from a model assumption) to a particular rival model invites the same risks as when explaining other small P-values by invoking a rival insofar as the null and the rival model do not exhaust the possibilities.
SIST= Mayo, D (2018), Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, CUP.