Yes, checking calibration of probability forecasts is part of Bayesian statistics

December 6, 2012

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Yes, checking calibration of probability forecasts is part of Bayesian statistics. At the end of this post are three figures from Chapter 1 of Bayesian Data Analysis illustrating empirical evaluation of forecasts.

But first the background. Why am I bringing this up now? It’s because of something Larry Wasserman wrote the other day:

One of the striking facts about [baseball/political forecaster Nate Silver's recent] book is the emphasis the Silver places on frequency calibration. . . . Have no doubt about it: Nate Silver is a frequentist. For example, he says:

One of the most important tests of a forecast — I would argue that it is the single most important one — is called calibration. Out of all the times you said there was a 40 percent chance of rain, how often did rain actually occur? If over the long run, it really did rain about 40 percent of the time, that means your forecasts were well calibrated.

I had some discussion with Larry in the comments section of his blog and raised the following point: There is such a thing as Bayesian calibration of probability forecasts. If you are predicting a binary outcome using a Bayesian prediction p.hat (where p.hat is the posterior expectation E(|y), then Bayesian calibration requires that E(|p.hat) = p.hat for any p.hat. This isn’t the whole story (as always, calibration matters but so does precision).

The last time I took (or taught) a theoretical statistics course was almost thirty years ago, but I recall frequentist coverage to be defined with the expectation taken conditional on the value of the unknown parameters theta in the model. The calibration Larry describes above (for another example, see here and scroll down) is unconditional on theta, thus Bayesian.

I haven’t read Nate’s book so I’m not sure what he does. But I expect his calibration is Bayesian. Just about any purely data-based calibration will be Bayesian, as we never know theta.

Larry responded in the comments. I don’t completely understand his reply, but I think he says that that unconditional coverage calculations are frequentist also.

In that case, maybe we can divide up the coverage calculations as follows: Unconditional coverage (E( = E(p.hat)) is both a Bayesian and frequentist property. (For both modes of inference, unconditional coverage will occur if all the assumptions are true.) Coverage conditional on data (E(|p.hat) = p.hat for any p.hat) is Bayesian. Nate is looking for Bayesian coverage. Coverage conditional on theta (E(|theta) = E(p.hat|theta)) is frequentist.

What is different about Bayesian inference? In Bayesian inference you make more assumptions and then can make more claims (hence the Bayesian quote, “With great power comes great responsibility”). Frequentists such as Larry are wary (perhaps justifiably so) of making too many assumptions. They’d rather develop methods with good average coverage properties under minimal assumptions. These statistical methods have a long tradition, have solved many important applied problems, and are the subject or research right now. I have no problem with these non-Bayesian methods even if I do not often use them myself. When it comes to frequency evaluation, the point is that Bayesian inference is supposed to be calibrated conditional on any aspect of the data.

To return to the title of this post, yes, checking calibration of probability forecasts is part of Bayesian statistics. We have two examples of this calibration in the very first chapter of Bayesian Data Analysis. Calibration is also a central topic in treatments of Bayesian decision theory such as the books by J. Q. Smith and Bob Clemen. I think it’s fair enough to agree with Larry that these are frequency calculations. But they are Bayesian frequency calculations by virtue of being conditional on data, not on unknown parameters. For Bayesians such as myself (and, perhaps, for the tens of thousands of readers of our book), probabilities are empirical quantities, to be measured, modeled, and evaluated in prediction. I think this should make Larry happy, that frequency evaluation (albeit conditional on y, not theta) is central to modern Bayesian statistics.

Not f or b

Thus, it’s not Frequentist or Bayesian. Frequency evaluations are (or should be) important for all statisticians but they can be done in different ways. I think Nate’s doing them in the Bayesian way but I’ll accept Larry’s statement that Nate and I and other applied Bayesians are frequentists too (of the sort that perform our frequency evaluations conditional on observed data rather than unknown parameters). And I do see the conceptual (and, at times, practical) appeal of frequentist methods that allow fewer probability statements but make correspondingly fewer assumptions, even if I don’t usually go that way myself.

The calibrations from chapter 1

Chapter 1 of Bayesian Data Analysis. You can’t get much more Bayesian than that.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science