For the first time, I’m excerpting all of Excursion 1 Tour II from SIST (2018, CUP).

**1.4 The Law of Likelihood and Error Statistics**

If you want to understand what’s true about statistical inference, you should begin with what has long been a holy grail–to use probability to arrive at a type of logic of evidential support–and in the first instance you should look not at full-blown Bayesian probabilism, but at comparative accounts that sidestep prior probabilities in hypotheses. An intuitively plausible logic of comparative support was given by the philosopher Ian Hacking (1965)–the Law of Likelihood. Fortunately, the Museum of Statistics is organized by theme, and the Law of Likelihood and the related Likelihood Principle is a big one.

*Law of Likelihood (LL):*Data ** x **are better evidence for hypothesis

*H*than for

_{1 }*H*if

_{0 }**is more probable under**

*x**H*than under

_{1 }*H*: Pr(

_{0}

*x;**H*) > Pr(

_{1}

*x;**H*) that is,

_{0}*the likelihood ratio LR*of

*H*over

_{1 }*H*exceeds 1.

_{0 }*H _{0 }*and

*H*are statistical hypotheses that assign probabilities to the values of the random variable

_{1 }**A fixed value of**

*X.***is written**

*X*

*x*_{0}, but we often want to generalize about this value, in which case, following others, I use

**. The**

*x**likelihood of the hypothesis*

*H,*given data

**, is the probability of observing**

*x***, under the assumption that**

*x**H*is true or adequate in some sense. Typically, the ratio of the likelihood of

*H*over

_{1 }*H*also supplies the quantitative measure of comparative support. Note when

_{0 }

**X***is continuous, the probability is assigned over a small interval around*

*to avoid probability 0.*

**X**

**Does the Law of Likelihood Obey the Minimal Requirement for Severity?**

Likelihoods are vital to all statistical accounts, but they are often misunderstood because the data are fixed and the hypothesis varies. Likelihoods of hypotheses should not be confused with their probabilities. Two ways to see this. First, suppose you discover all of the stocks in Pickrite’s promotional letter went up in value (** x**)–all winners. A hypothesis

*H*to explain this is that their method always succeeds in picking winners.

*H*

*entails*

**, so the likelihood of**

*x**H*given

**is 1. Yet we wouldn’t say**

*x**H*is therefore highly probable, especially without reason to put to rest that they culled the winners post hoc. For a second way, at any time, the same phenomenon may be perfectly predicted or explained by two rival theories; so both theories are equally likely on the data, even though they cannot both be true.

Suppose Bristol-Roach, in our Bernoulli tea tasting example, got two correct guesses followed by one failure. The observed data can be represented as *x*_{0 }=<1,1,0>. Let the hypotheses be different values for θ, the probability of success on each independent trial. The likelihood of the hypothesis *H _{0 }*: θ = 0.5, given

*x*_{0}, which we may write as Lik(0.5), equals (½)(½)(½) = 1/8. Strictly speaking, we should write Lik(θ;

*x*_{0}), because it’s always computed given data

*x*_{0}; I will do so later on. The likelihood of the hypothesis θ = 0.2 is Lik(0.2)= (0.2)(0.2)(0.8) = 0.032. In general, the likelihood in the case of Bernoulli independent and identically distributed trials takes the form: Lik(θ)= θ

*(1- θ)*

^{s}*, 0< θ<1, where*

^{f}*s*is the number of successes and

*f*the number of failures. Infinitely many values for θ between 0 and 1 yield positive likelihoods; clearly, then likelihoods do not sum to 1, or any number in particular. Likelihoods do not obey the probability calculus.

The Law of Likelihood (LL) will immediately be seen to fail on our minimal severity requirement – at least if it is taken as an account of inference. Why? There is no onus on the Likelihoodist to predesignate the rival hypotheses – you are free to search, hunt, and post-designate a more likely, or even maximally likely, rival to a test hypothesis *H _{0 }*

Consider the hypothesis that θ = 1 on trials one and two and 0 on trial three. That makes the probability of ** x **maximal. For another example, hypothesize that the observed pattern would always recur in three-trials of the experiment (I. J. Good said in his cryptoanalysis work these were called “kinkera”). Hunting for an impressive fit, or trying and trying again, one is sure to find a rival hypothesis

*H*much better “supported” than

_{1 }*H*even when

_{0 }*H*is true. As George Barnard puts it, “there

_{0 }*always*is such a rival hypothesis, viz. that things just had to turn out the way they actually did” (1972, p. 129).

Note that for any outcome of *n *Bernoulli trials, the likelihood of *H _{0 }*: θ = 0.5 is (0.5)

*, so is quite small. The likelihood ratio (LR) of a best-supported alternative compared to*

^{n}*H*would be quite high. Since one could always erect such an alternative,

_{0 }(*) Pr(LR in favor of *H _{1 }*over

*H*;

_{0}*H*) = maximal.

_{0}*Thus the LL permits BENT evidence. *The severity for *H _{1 }*is minimal, though the particular

*H*is not formulated until the data are in hand.I call such maximally fitting, but minimally severely tested, hypotheses

_{1 }*Gellerized*, since Uri Geller was apt to erect a way to explain his results in ESP trials. Our Texas sharpshooter is analogous because he can always draw a circle around a cluster of bullet holes, or around each single hole. One needn’t go to such an extreme rival, but it suffices to show that the LL does not control the probability of erroneous interpretations.

What do we do to compute (*)? We look beyond the specific observed data to the behavior of the general rule or method, here the LL. The output is always a comparison of likelihoods. We observe one outcome, but we can consider that for any outcome, unless it makes *H _{0 }*maximally likely, we can find an

*H*that is more likely. This lets us compute the relevant properties of the method: its inability to block erroneous interpretations of data. As always, a severity assessment is one level removed: you give me the rule, and I consider its latitude for erroneous outputs. We’re actually looking at the probability distribution of the rule, over outcomes in the sample space. This distribution is called a

_{1 }*sampling distribution.*It’s not a very apt term, but nothing has arisen to replace it. For those who embrace the LL, once the data are given, it’s irrelevant what other outcomes could have been observed but were not. Likelihoodists say that such considerations make sense only if the concern is the performance of a rule over repetitions, but not for inference from the data. Likelihoodists hold to “the irrelevance of the sample space” (once the data are given). This is the key contrast between accounts based on error probabilities (error statistical) and logics of statistical inference.

**To continue reading Excursion 1 Tour II, go here.**

__________

This excerpt comes from Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars (Mayo, CUP 2018).

Earlier excerpts and mementos from SIST up to Dec 31, 20018 are here.

Jan 10, 2019 Excerpt from SIST is here, Jan 27 is here, and Feb 23 here.

Jan 13, 2019 Mementos from SIST (Excursion 4) are here. These are summaries of all 4 tours.

March 5, 2019 Blurbs of all 16 Tours can be found here.