Neyman: Distinguishing tests of statistical hypotheses and tests of significance might have been a lapse of someone’s pen

April 17, 2017

(This article was originally published at Statistics – Error Statistics Philosophy, and syndicated at StatsBlogs.)


April 16, 1894 – August 5, 1981

I’ll continue to post Neyman-related items this week in honor of his birthday. This isn’t the only paper in which Neyman makes it clear he denies a distinction between a test of  statistical hypotheses and significance tests. He and E. Pearson also discredit the myth that the former is only allowed to report pre-data, fixed error probabilities, and are justified only by dint of long-run error control. Controlling the “frequency of misdirected activities” in the midst of finding something out, or solving a problem of inquiry, on the other hand, are epistemological goals. What do you think?

Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena
by Jerzy Neyman

ABSTRACT. Contrary to ideas suggested by the title of the conference at which the present paper was presented, the author is not aware of a conceptual difference between a “test of a statistical hypothesis” and a “test of significance” and uses these terms interchangeably. A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests. The procedure is illustrated on two examples: (i) Le Cam’s (and associates’) study of immunotherapy of cancer and (ii) a socio-economic experiment relating to low-income homeownership problems.

I recommend, especially, the example on home ownership. Here are two snippets:


The title of the present session involves an element that appears mysterious to me. This element is the apparent distinction between tests of statistical hypotheses, on the one hand, and tests of significance, on the other. If this is not a lapse of someone’s pen, then I hope to learn the conceptual distinctionParticularly with reference to applied statistical work in a variety of domains of Science, my own thoughts of tests of significance, or EQUIVALENTLY of tests of statistical hypotheses, are that they are tools to reduce the frequency of errors.

(iv) A similar remark applies to the use of the words “decision” or “conclusion”. It seem to me that at our discussion, these particular words were used to designate only something like a final outcome of complicated analysis involving several tests of different hypotheses. In my own way of speaking, I do not hesitate to use the words ‘decision’ or “conclusion” every time they come handy. For example, in the analysis of the follow-up data for the [home ownership] experiment, Mark Eudey and I started by considering the importance of bias in forming the experimental and control groups of families. As a result of the tests we applied, we decided to act on the assumption (or concluded) that the two groups are not random samples from the same population. Acting on this assumption (or having reached this conclusions), we sought for ways to analyze that data other than by comparing the experimental and the control groups. The analyses we performed led us to “conclude” or “decide” that the hypotheses tested could be rejected without excessive risk of error. In other words, after considering the probability of error (that is, after considering how frequently we would be in error if in conditions of our data we rejected the hypotheses tested), we decided to act on the assumption that “high” scores on “potential” and on “education” are indicative of better chances of success in the drive to home ownership. (750-1; the emphasis is Neyman’s)

To read the full (short) paper: Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.

Following Neyman, I’ve “decided” to use the terms ‘tests of hypotheses’ and ‘tests of significance’ interchangeably in my book.[1] Now it’s true that Neyman was more behavioristic than Pearson, and it’s also true that tests of statistical hypotheses or tests of significance need an explicit reformulation and statistical philosophy to explicate the role of error probabilities in inference. My way of providing this has been in terms of severe tests. However, in Neyman-Pearson applications, more than in their theory, you can find many examples as well. Recall Neyman’s paper, “The Problem of Inductive Inference” (Neyman 1955) wherein Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudolf Carnap:

I am concerned with the term “degree of confirmation” introduced by Carnap.  …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true].  The question is: does this result “confirm” the hypothesis that H0 is true of the particular data set? (Neyman, pp 40-41).

Neyman continues:

The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc.  If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H0, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present].  Therefore, the failure of the test to reject H0 cannot be reasonably considered as anything like a confirmation of H0.  The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95. (ibid.)

The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.

I’m adding a new paper that I hadn’t posted before that echoes these same sentiments on the use of power, post data to evaluate what is “confirmed” ‘The Use of the Concept of Power in Agricultural Experimentation’.

Neyman, like Peirce, Popper and many others, hold that the only “logic” is deductive logic. “Confirmation” for Neyman is akin to Popperian “corroboration”–you could corroborate a hypothesis H only to the extent that it passed a severe test–one with a high probability of having found flaws in H, if they existed.  Of course, Neyman puts this in terms of having high power to reject H, if H is false, and high probability of finding no evidence against H if true, but it’s the same idea. But the use of power post-data is to interpret the discrepancies warranted in the given test. (This third use of power is also in Neyman 1956, responding to Fisher, the Triad).Unlike Popper, however, Neyman actually provides a methodology that can be shown to accomplish the task reliably.

Still, Fisher was correct to claim that Neyman is merely recording his preferred way of speaking. One could choose a different way. For example, Peirce defined induction as passing a severe test, and Popper said you could define it that way if you wanted to. But the main thing is that Neyman is attempting to distinguish the “inductive” or “evidence transcending” conclusions that statistics affords, on his approach,[2] from assigning to hypotheses degrees of belief, probability, support, plausibility or the like.

De Finetti gets it right when he says that the expression “inductive behavior…that was for Neyman simply a slogan underlining and explaining the difference between his own, the Bayesian and the Fisherian formulations” became, with Wald’s work, “something much more substantial” (de Finetti 1972, p.176). De Finetti called this “the involuntarily destructive aspect of Wald’s work” (ibid.).

Related papers on tests:

[1] That really is a decision, though it’s based on evidence that doing so is in sync with what both Neyman and Pearson thought. There are plenty of times, by the way, where Fisher is more behavioristic and less evidential than is Neyman, and certainly less than E. Pearson. I think this “he said/she said” route to understanding statistical methods is a huge mistake. I keep saying, “It’s the method’s stupid!”

[2] And, Neyman rightly assumed at first, from Fisher’s approach. Fisher’s loud rants, later on, that Neyman turned his tests into crude acceptance sampling affairs akin to Russian 5 year-plans, and money-making goals of U.S. commercialism, all occurred after the break in 1935 which registered a conflict of egos, not statistical philosophies. Look up “anger management” on this blog.

Fisher is the arch anti-Bayesian; whereas, Neyman experimented with using priors at the start. The problem wasn’t so much viewing parameters as random variables, but lacking knowledge of what their frequentist distributions could possibly be. Thus he sought methods whose validity held up regardless of priors.  Here E. Pearson was closer to Fisher, but unlike the two others, he was a really nice guy. (I hope everyone knows I’m talking of Egon here, not his mean daddy.) See chapter 11 of EGEK (1996):

[3] Who drew the picture of Neyman above? Anyone know?


de Finetti, B. 1972. Probability, Induction and Statistics: The Art of Guessing. Wiley.

Neyman, J. 1957. “The Use of the Concept of Power in Agricultural Experimentation, Journal of the Indian Society of Agricultural Statistics, 9(1): 9–17.

Neyman, J. 1976. “Tests of Statistical Hypotheses and Their Use in Studies of Natural Phenomena.Commun. Statist. Theor. Meth. A5(8), 737-751.


Filed under: Error Statistics, Neyman, Statistics Tagged: behavioristic vs evidential

Please comment on the article here: Statistics – Error Statistics Philosophy

Tags: , , ,