November 18, 2012

(This article was originally published at Normal Deviate, and syndicated at StatsBlogs.)


When I started this blog, I said I wouldn’t write about the Bayes versus Frequentist thing. I thought that was old news.

But many things have changed my mind. Nate Silver’s book, various comments on my blog, comments on other blogs, Sharon McGrayne’s book, etc have made it clear to me that there is still a lot of confusion about what Bayesian inference is and what Frequentist inference is.

I believe that many of the arguments about Bayes versus Frequentist are really about: what is the definition of Bayesian inference?

1. Some Obvious (and Not So Obvious) Statements

Before I go into detail, I’ll begin by making a series of statements.

Frequentist Inference is Great For Doing Frequentist Inference.
Bayesian Inference is Great For Doing Bayesian Inference.

Frequentist inference and Bayesian Inference are defined by their goals, not their methods.

A Frequentist analysis need not have good Bayesian properties.
A Bayesian analysis need not have good frequentist properties.

Bayesian Inference {\neq} Using Bayes Theorem

Bayes Theorem {\neq} Bayes Rule

Bayes Nets {\neq} Bayesian Inference

Frequentist Inference is not superior to Bayesian Inference.
Bayesian Inference is not superior to Frequentist Inference.
Hammers are not superior to Screwdrivers.

Confidence Intervals Do Not Represent Degrees of Belief.
Posterior Intervals Do Not (In General) Have Frequency Coverage Properties.

Saying That Confidence Intervals Do Not Represent Degrees of Belief Is Not a Criticism of Frequentist Inference.
Saying That Posterior Intervals Do Not Have Frequency Coverage Properties Is Not a Criticism of Bayesian Inference.

Some Scientists Misinterpret Confidence Intervals as Degrees of Belief.
They Also Misinterpret Bayesian Intervals as Confidence Intervals.

Mindless Frequentist Statistical Analysis is Harmful to Science.
Mindless Bayesian Statistical Analysis is Harmful to Science.

2. The Definition of Bayesian and Frequentist Inference

Here are my definitions. You may have different definitions. But I am confident that my definitions correspond to the traditional definitions used in statistics for decades.

But first, I should say that Bayesian and Frequentist inference are defined by their goals not their methods.

The Goal of Frequentist Inference: Construct procedure with frequency guarantees. (For example, confidence intervals.)

The Goal of Bayesian Inference: Quantify and manipulate your degrees of beliefs. In other words, Bayesian inference is the Analysis of Beliefs.

(I think I got the phrase, “Analysis of Beliefs” from Michael Goldstein.)

My point is that “using Bayes theorem” is neither necessary or sufficient for defining Bayesian inference. A frequentist analysis could certainly include the use of Bayes’ theorem. And conversely, it is possible to do Bayesian inference without using Bayes’ theorem (as Michael Goldstein, for example, has shown). Let me summarize this point in a table:

Fairly soon I am going to post a review of Nate Silver’s new book. (Short review: great book. Buy it and read it.) As I will discuss in that review, Nate argues forcefully that Bayesian analysis is superior to Frequentist analysis. But then he spends most of the book assessing predictions by how good their frequency properties are. For example, he says that a weather forecaster is good if it rains 95 percent of the times he says there is a 95 percent chance of rain. In others, he loves to use Bayes’ theorem but his goals are overtly frequentist. I’ll say more about this in my review of his book. I use it here as an example of how one can be a user of Bayes theorem and still have frequentist goals.

3. Coverage

An example of a frequency guarantee is coverage. Let {\theta = T(P)} be a function of a distribution {P}. Let {{\cal P}} be a set of distributions. Let {X_1,\ldots, X_n \sim P} be a sample from some {P\in {\cal P}}. Finally, let {C_n = C(X_1,\ldots,X_n)} be a set valued mapping. Then {C_n} has coverage {1-\alpha} if

\displaystyle  \inf_{P\in {\cal P}}P^n( T(P) \in C_n) \geq 1-\alpha

where {P^n} is the {n}-fold product measure defined by {P}.

We say that {C_n} is a {1-\alpha} confidence set if it has coverage {1-\alpha}. A Bayesian {1-\alpha} posterior set will not (in general) have coverage {1-\alpha}. This is not a criticism of Bayesian inference, although anytime I mention this point, some people seem to take it that way. Bayesian inference is about the Analysis of Beliefs; it makes no claims about coverage.

I think there would be much less disagreement and confusion if we used different symbols for frequency probabilities and degree-of-belief probabilities. For example, suppose we used {{\sf Fr}} for frequentist statements and {{\sf Bel}} for degree-of-belief statements. Then the fact that coverage and posterior probability are different would be written

\displaystyle  {\sf Fr}_\theta(\theta\in C_n) \neq {\sf Bel}(\theta \in C_n|X_1,\ldots,X_n).

Unfortunately, we use the same symbol {P} for both in which case the above statement becomes

\displaystyle  P_\theta(\theta\in C_n) \neq P(\theta \in C_n|X_1,\ldots,X_n)

which, I think, just makes things confusing.

Of course, there are cases where Bayes and Frequentist methods agree, or at least, agree approximately. But that should not lull us into ignoring the conceptual differences.

4. Examples

Here are a couple of simple examples.

Example 1. Let {X_1,\ldots, X_n \sim N(\theta,1)\equiv P_\theta} and suppose our prior is {\theta \sim N(0,1)}. Let {B_n} be the equi-tailed 95 percent Bayesian posterior interval. Here is a plot of the frequentist coverage {{\sf Cov}_\theta =P_\theta(\theta\in B_n)} as a function of {\theta}. Note that {{\sf Cov}_\theta} is the frequentist probability that the random interval {B_n} traps {\theta}. ({B_n} is random because it is a function of {X_1,\ldots, X_n}.) Also, plotted is the coverage of the usual confidence interval {C_n=[\overline{X}_n - z_{\alpha/2}/\sqrt{n},\ \overline{X}_n + z_{\alpha/2}/\sqrt{n}]}. This is a constant function, equal to 0.95 for every {\theta}.

Of course, the coverage of {B_n} {{\sf Cov}_\theta} is sometimes higher than {1-\alpha} and sometimes lower. The overall coverage is {\inf_\theta {\sf Cov}_\theta =0} because {{\sf Cov}_\theta} tends to {0} as {|\theta|} increases. At the risk of being very repetitive, this is not meant as a criticism of Bayes. I am just trying to make the difference clear.

Example 2. A {1-\alpha} distribution free confidence interval {C_n} for the median {\theta} of a distribution {P} can be constructed as follows. (This is a standard construction that can be found in any text.) Let {Y_1,\ldots, Y_n \sim P}. Let

\displaystyle  Y_{(1)} \leq Y_{(2)} \leq \cdots Y_{(n)}

denote the order statistics (the ordered values). Choose {k} such that {P(k < B < n-k)\geq 1-\alpha} where {B\sim {\rm Binomial}(n,1/2)}. The confidence interval is {C_n = [Y_{(k+1)},Y_{(n-k)}]}. It is easily shown that

\displaystyle  \inf_P P^n(\theta \in C_n) \geq 1-\alpha

where the infimum is over all distributions {P}. So {C_n} is a {1-\alpha} confidence interval. Here is a plot showing some simulations I did:

The plot shows the first 50 simulations. In the first simulation I picked some distribution {F_1}. Let {\theta_1} be the median of {F_1}. I generated {n=100} observations from {F_1} and then constructed the interval. The confidence interval is the first vertical line. The true value is the dot. For the second simulation, I chose a different distribution {F_2}. Then I generated the data and constructed the interval. I did this many times, each time using a different distribution with a different true median. The blue interval shows the one time that the confidence interval did not trap the median. I did this 10,000 times (only 50 are shown). The interval covered the true value 94.33 % of the time. I wanted to show this plot because, when some texts show confidence interval simulations like this they use the same distribution for each trial. This is unnecessary and it gives the false impression that you need to repeat the same experiment in order to discuss coverage.

How would a Bayesian analyze this problem. The Bayesian analysis of this problem would start with a prior {\pi(P)} on the distribution {P}. This defines a posterior {\pi(P|Y_1,\ldots, Y_n)}. (But the posterior is not obtained via Bayes theorem! There is no dominating measure here. Nonetheless, there is still a well-defined posterior. But that’s a technical point we can discuss another day.) The posterior {\pi(P|Y_1,\ldots, Y_n)} induces a posterior {\pi(\theta|Y_1,\ldots, Y_n)} for the median. And from this we can get a 95 percent Bayesian interval {B_n} say, for the median. The interval {B_n}, of course, depends on the prior {\pi}. I’d love to include a numerical experiment to compare {B_n} and {C_n} but time does not permit. It will make a good homework exercise in a course.

5. Grey Area

There is much grey area between the two definitions I gave. I suspect, for example, that Andrew Gelman would deny being bound by either of the definitions I gave. That’s fine. But I still think it is useful to have clear, if somewhat narrow, definitions to begin with.

6. Identity Statistics

One thing that has harmed statistics — and harmed science — is identity statistics. By this I mean that some people identify themselves as “Bayesians” or “Frequentists.” Once you attach a label to yourself, you have painted yourself in a corner.

When I was a student, I took a seminar course from Art Dempster. He was the one who suggested to me that it was silly to describe a person as being Bayesian of Frequentist. Instead, he suggested that we describe a particular data analysis as being Bayesian of Frequentist. But we shouldn’t label a person that way.

I think Art’s advice was very wise.

7. Failures of Assumptions

I have had several people make comments like: “95 percent intervals don’t contain the true value 95 percent of the time.” Here is what I think they mean. When we construct a confidence interval {C_n} we inevitably need to make some assumptions. For example, we might assume that the data are iid. In practice, these assumptions might fail to hold in which case the confidence interval will not have its advertised coverage. This is true but I think this obscures the discussion.

Both Bayesian and Frequentist inference can fail to achieve their stated goals for a variety of reasons. Failures of assumptions are of great practical importance but they are not criticisms of the methods themselves.

Suppose you apply special relativity to predict the position of a satellite and your prediction is wrong because some of the assumptions you made don’t hold. That’s not a valid criticism of special relativity.

8. No True Value

Some people like to say that it is meaningless to discuss the “true value of a parameter.” No problem. We could conduct this entire conversation in terms of predicting observable random variables instead. This would not change my main points.

9. Conclusion

I’ll close by repeating what I wrote at the beginning: Frequentist inference is great for doing frequentist inference. Bayesian inference is great for doing Bayesian inference. They are both useful tools. The danger is confusing them.

10. Coming Soon On This Blog!

Future posts will include:

-A guest post by Ryan Tibshirani

-A guest post by Sivaraman Balikrishnan

-My review of Nate Silver’s book

-When Does the Bootstrap Work?

-Matrix-Fu, that deadly combination of Matrix Calculus and Kung-Fu.

Please comment on the article here: Normal Deviate