I am the supercargo

In a form of sympathetic magic, many built life-size replicas of airplanes out of straw and cut new military-style landing strips out of the jungle, hoping to attract more airplanes. – Wikipedia

Twenty years ago, Geri Halliwell left the Spice Girls, so I’ve been thinking about Cargo Cults a lot.

As an analogy for what I’m gonna talk about, it’s … inapt, but only if you’ve looked up Cargo Cults. But I’m going with it because it’s pride week and Drag Race is about to start

The thing is, it can be hard to identify if you’re a member of a Cargo Cult. The whole point is that from within the cult, everything seems good and sensible and natural. Or, to quote today’s titular song,

They say “our John Frum’s coming,
He’s bringing cargo…” and the rest
At least they don’t expect to be
Surviving their own deaths.

This has been on my mind on and off for a while now. Mostly from a discussion I had with someone in the distant-enough-to-not-actually-remember-who-I-was-talking-to past, where we were arguing about something (I’m gonna guess non-informative vs informative priors, but honestly I do not remember) and this person suggested that the thing I didn’t like was a good idea, at least in part, because Harold Jeffreys thought it was a good idea.

A technical book written in the 1930s being used as a coup de grâce to end a technical argument in 2018 screams cargo cult to me. But is that fair? (Extreme narrator voice: It is not fair.)

I guess this is one of the problems we need to deal with as a field: how do we maintain the best bits of our old knowledge (pre-computation, early computation, MCMC, and now) while dealing with the rapidly evolving nature of modern data and modern statistical questions?

So how do you avoid cult like behaviour? Well, as a child of Nü-Metal*, I think there’s only one real answer:

Break stuff

I am a firm believer that before you use a method, you should know how to break it. Describing how to break something should be an essential part of describing a new piece of statistical methodology (or, for that matter, of resurrecting an existing one). At the risk of getting all Dune on you, he who can destroy a thing controls a thing. 

(We’re getting very masc4masc here. Who’d’ve thought that me with a hangover was so into Sci-Fi & Nü-Metal? Next thing you know I’ll be doing a straight-faced reading of Ender’s Game. Look for me at 2am explaining to a woman who’d really rather not be still talking to me that they’re just called “buggers” because they look like bugs.)

So let’s break something.

This isn’t meant to last, this is for right now

Specifically let’s talk about breaking leave-one-out cross validation (LOO-CV) for computing the expected log-predictive density (elpd or sometimes LOO-elpd). Why? Well, partly because I also read that paper that Aki commented on a few weeks back that made me think more about the dangers of accidentally starting a cargo cult. (In this analogy, the cargo is a R package and a bunch of papers.)

One of the fabulous things about this job is that there are two things you really can’t control: how people will use the tools you construct, and how long they will continue to take advice that turned out not to be the best (for serious, cool it with the Cauchy priors!).

So it’s really important to clearly communicate flaws in method both when it’s published and later on. This is, of course, in tension with the desire to actually get work published, so we do  what we can.

Now, Aki’s response was basically definitive, so I’m mostly not going to talk about the paper. I’m just going to talk about LOO.

One step closer to the edge

One of the oldest criticisms of using LOO for model selection is that it is not necessarily consistent when the model list contains the true data generating model (the infamous, but essentially useless** M-Closed*** setting). This contrasts with model selection using Bayes’ Factors, which are consistent in the useless asymptotic regime. (Very into Nü-Metal. Very judgemental.)

Being that judge-y without explaining the context is probably not good practice, so let’s actually look at the famous case where model selection will not be consistent: Nested models.

For a very simple example, let’s consider two potential models:

\text{M1:}\; y_i \sim N(\mu, 1)

\text{M2:}\; y_i \sim N(\mu + \beta x_i, 1)

The covariate x_i can be anything, but for simplicity, let’s take it to be x_i \sim N(0,1).

And to put us in an M-Closed setting, let’s assume the data that we are seeing is drawn  from the first model (M1) with \mu=0In this situation, model selection based on the LOO-expected log predictive density will be inconsistent.

Spybreak!

To see this, we need to understand what the LOO methods are using to select models. It is the ability to predict a new data point coming from the (assumed iid) data generating mechanism. If two models asymptotically produce the same one point predictive distribution, then the LOO-elpd criterion will not be able to separate them.  This is different to Bayes’ factors, which will always choose the simplest of the models that make the same predictions.

Let’s look at what happens asymptotically. (And now you see why I focussed on such simple models: I’m quite bad at maths.)

Because these models are regular and have finite-dimensional parameters, they both satisfy all of the conditions of the Bernstein-von Mises theorem (which I once wrote about in these pages during an epic panic attack) which means that we know in both cases that the posterior for the  model parameters \theta after observing n data points will be \theta_j^{(n)} = \theta_{(j)}^* + \mathcal{O}_p(n^{-1/2}). Here:

  • \theta_j^{(n)} is the random variable distributed according to the posterior for model j after n obeservations,
  • \theta_{(j)}^* is the true parameter from model j that would generate the data. In this case \theta_{(1)}^*=0 and \theta_{(2)}^*=(0,0)^T.
  • And \mathcal{O}_p(n^{-1/2}) is a random variable with (finite) standard deviation that goes to zero as increases like n^{-1/2}.

Arguing loosely (again: quite bad a maths), the LOO-elpd criterion is trying to compute**** E_{\theta_j^{(n)}}\left[\log(p(y\mid\theta_j^{(n)}))\right] which asymptotically looks like \log(p(y\mid\theta_j^*))+O(n^{-1/2}).

This means that, asymptotically, both of these models will give rise to the same posterior predictive distribution and hence LOO-elpd will not be able to tell between them.

Take a look around

LOO-elpd can’t tell them apart, but we sure can! The thing is, the argument of inconsistency in this case only really holds water if you never actually look at the parameter estimates. If you know that you have nested models (ie that one is the special case of another), you should just look at the estimates to see if there’s any evidence for the more complex model.  Or, if you want to do it more formally, consider the family of potential nested models as your M-Complete model class and use something like projpred to choose the simplest one.

All of which is to say that this inconsistently is mathematically a very real thing but should not cause practical problems unless you use model selection tools blindly and thoughtlessly.

For a bonus extra fact: This type of setup will also cause the stacking weights we (Yuling, Aki, Andrew, and me) proposed not to stabilize. Because any convex combination will asymptotically give the same distribution. So be careful if you’re trying to interpret model stacking weights as posterior model probabilities.

Have a cigar

But I said I was going to break things. And so far I’ve just propped up the method yet again.

The thing is, there is a much bigger problem with LOO-elpd. The problem is the assumption that leaving one observation out is enough to get a good approximation to the average value of the posterior log-predictive over a new data set.  This is all fine when the data is iid draws from some model.

LOO-elpd can fail catastrophically and silently when the data cannot be assumed to be iid. A simple case where this happens is time-series data, where you should leave out the whole future instead.  Or spatial data, where you should leave out large-enough spatial regions that the point you are predicting is effectively independent of all of the points that remain in the data set. Or when your data has multilevel structure, where you really should leave out whole strata.

In all of these cases, cross validation can be a useful too, but it’s k-fold cross validation that’s needed rather than LOO-CV. Moreover, if your data is weird, it can be hard to design a cross validation scheme that’s defensible. Worse still, while LOO is cheap (thanks to Aki and Jonah’s work on the loo package), k-fold CV requires re-fitting the model a lot of times, which can be extremely expensive.

All of this is to say that if you want to avoid an accidental LOO cargo cult, you need to be very aware of the assumptions and limitations of the method and to use it wisely, rather than automatically. There is no such thing as an automatic statistician.

Notes:

* One of the most harrowing days of my childhood involved standing that the check out of the Target in Buranda (a place that has not changed in 15 year, btw) and having to choose between buying the first Linkin Park album and the first Coldplay album. You’ll be pleased to know that I made the correct choice.

** When George Box said that “All models are wrong” he was saying that M-Closed is a useless assumption that is never fulfilled.

*** The three modelling scenarios (according to Bernado and Smith):

  • M-closed means the true data generating model is one of the candidate models M_k, although which one is  unknown to researchers
  • M-complete refers to the situation where the true model exists (and we can specify the explicit form for it), but for some reason it is not in the list of candidate models.
  • M-open refers to the situation in which we know the true model is not in the set of candidate models and we cannot specify it’s explicit form (this is the most common one).

**** A later edit: I forgot the logarithms in the expected log-densities, because by the time I finished this a drag queen had started talking and I knew it was time to push publish and finish my drink.

The post I am the supercargo appeared first on Statistical Modeling, Causal Inference, and Social Science.