(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)
John Cook considers how people justify probability distribution assumptions:
Sometimes distribution assumptions are not justified.
Sometimes distributions can be derived from fundamental principles [or] . . . on theoretical grounds. For example, large samples and the central limit theorem together may justify assuming that something is normally distributed.
Often the choice of distribution is somewhat arbitrary, chosen by intuition or for convenience, and then empirically shown to work well enough.
Sometimes a distribution can be a bad fit and still work well, depending on what you’re asking of it.
Cook continues:
The last point is particularly interesting. It’s not hard to imagine that a poor fit would produce poor results. It’s surprising when a poor fit produces good results.
And then he gives an example of an effective but inaccurate model used to model survival times in a clinical trial. Cook explains:
The [poorly-fitting] method works well because of the question being asked. The method is not being asked to accurately model the distribution of survival times for patients in the trial. It is only being asked to determine whether a trial should continue or stop, and it does a good job of doing so. As the simulations in this paper show, the method makes the right decision with high probability, even when the actual survival times are not exponentially distributed.
This is an excellent point, and I’d like to elaborate by considering a different way in which a bad model can work well.
An example where a bad model works well because of its implicit assumptions
In Section 9.3 of Bayesian Data Analysis (second edition), we compare several different methods for estimating a population total from a random sample in an artificial problem in which the population is the set of all cities and towns in a state. The data are skewed—some cities have much more population than others—but if you use standard survey-sampling estimates and standard errors, you get OK inferences. The inferences are not perfect—in particular, the confidence interval can include negative values because the brute-force approach doesn’t “know” that the data (city populations) are all positive—but the intervals make sense and have reasonable coverage properties. In contrast, as Don Rubin showed when he first considered this example, comparable analyses applying the normal distribution to log or power-transformed data can give horrible answers.
What’s going on? How come the interval estimates based on these skewed data have reasonable coverage we use the normal distribution, while inferences based on the much more sensible lognormal or power-transformed models are so disastrous?
A quick answer is that the normal-theory method makes implicit use of the central limit theorem, but then this just pushes the question back one step: Why should the central limit theorem apply here? Why indeed. The theorem applies for this finite sample (n=100, in this case) because, although the underlying distribution is skewed, there are no extreme outliers. By using the normal-based interval, we are implicitly assuming a reasonable upper bound in the population. And, in fact, if we put an upper bound into the power-transformed model, it works even better.
The moral of the story? Sometimes an ill-fitting model works well because, although it doesn’t fit much of the data, it includes some assumption that is relevant to inferences, some aspect of the model that would be difficult to ascertain from the data alone. And, once we identify what that assumption is, we can put it directly into an otherwise better-fitting model and improve performance.
Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science
