I will continue to post mementos and, at times, short excerpts following the pace of one “Tour” a week, in sync with some book clubs reading *Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars* (SIST or Statinfast 2018, CUP), e.g., Lakens. This puts us at Excursion 2 Tour I, but first, here’s a quick Souvenir (Souvenir C) from Excursion 1 Tour II:

Souvenir C: A Severe Tester’s Translation GuideJust as in ordinary museum shops, our souvenir literature often probes treasures that you didn’t get to visit at all. Here’s an example of that, and you’ll need it going forward. There’s a confusion about what’s being done when the significance tester considers the set of all of the outcomes leading to a d(

) greater than or equal to 1.96, i.e., {xx: d() ≥ 1.96}, or just d(x) ≥ 1.96. This is generally viewed as throwing away the particularx, and lumping all these outcomes together. What’s really happening, according to the severe tester, is quite different. What’s actually being signified is that we are interested in the method, not just the particular outcome. Those who embrace the LP make it very plain that data-dependent selections and stopping rules drop out. To get them to drop in, we signal an interest in what the test procedurexwouldhaveyielded. This is a counterfactual and is altogether essential in expressing the properties of the method, in particular, the probability it would have yielded some nominally significant outcomeorother.When you see Pr(d(

) ≥ d(Xx_{0});H_{0}), or Pr(d() ≥ d(Xx_{0});H_{1}), for any particular alternative of interest, insert:“the test procedure would have yielded”

just before the d(

). In other words, this expression, with its inequality, is a signal of interest in, and an abbreviation for, the error probabilities associated with a test.X

Applying the Severity Translation.In Exhibit (i), Royall described a significance test with a Bernoulli(θ) model, testingH_{0}: θ ≤ 0.2 vs.H_{1}: θ >0.2. We blocked an inference from observed difference d() = 3.3 to θ = 0.8 as follows. (Recall thatxx= 0.53 and d(x_{0}) ≃ 3.3.)

We computedPr(d() > 3.3; θ = 0.8) ≃1.X

We translate it asPr(The test would yield d() > 3.3; θ = 0.8) ≃1.XWe then reason as follows:

Statistical inference: If θ = 0.8, then the method would virtually always give a difference larger than what we observed. Therefore, the data indicate θ < 0.8.(This follows for rejecting

H_{0}in general.) When we ask: “How often would your test have found such a significant effect even ifH_{0}is approximately true?” we are asking about the properties of the experiment thatdidhappen. The counterfactual “would have” refers to how the procedure would behave in general, not just with these data, but with other possible data sets in the sample space.