# You better check yo self before you wreck yo self

April 18, 2018
By

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

We (Sean Talts, Michael Betancourt, Me, Aki, and Andrew) just uploaded a paper (code available here) that outlines a framework for verifying that an algorithm for computing a posterior distribution has been implemented correctly. It is easy to use, straightforward to implement, and ready to be implemented as part of a Bayesian workflow.

This type of testing should be required in order to publish a new (or improved) algorithm that claims to compute a posterior distribution. It’s time to get serious about only publishing things that actually work!

You Oughta Know

Before I go into our method, let’s have a brief review of some things that are not sufficient to demonstrate that an algorithm for computing a posterior distribution actually works.

• Theoretical results that are anything less than demonstrably tight upper and lower bounds* that work in finite-sample situations.
• Comparison with a long run from another algorithm unless that algorithm has stronger guarantees than “we ran it for a long time”. (Even when the long-running algorithm is guaranteed to work, there is nothing generalizable here. This can only ever show the algorithm works on a specific data set.)
• Recovery of parameters from simulated data (this literally checks nothing)
• Running the algorithm on real data. (Again, this checks literally nothing.)
• Running the algorithm and plotting traceplots, autocorrelation, etc etc etc
• Computing the Gelman-Rubin R-hat statistic. (Even using multiple chains initialized at diverse points, this only checks if the Markov Chain has converged. It does not check that it’s converged to the correct thing)

I could go on and on and on.

The method that we are proposing does actually do a pretty good job at checking if an approximate posterior is similar to the correct one. It isn’t magic. It can’t guarantee that a method will work for any data set.

What it can do is make sure that for a given model specification, one dimensional posterior quantities of interest will be correct on average. Here, “on average” means that we average over data simulated from the model. This means that rather than just check the algorithm once when it’s proposed, we need to check the algorithm every time it’s used for a new type of problem. This places algorithm checking within the context of Bayesian Workflow.

This isn’t as weird as it seems. One of the things that we always need to check is that we are actually running the correct model. Programming errors happen to everyone and this procedure will help catch them.

Moreover, if you’re doing something sufficiently difficult, it can happen that even something as stable as Stan will quietly fail to get the correct result. The Stan developers have put a lot of work into trying to avoid these quiet cases of failure (Betancourt’s idea to monitor divergences really helped here!), but there is no way to user-proof software. The Simulation-Based Calibration procedure that we outline in the paper (and below) is another safety check that we can use to help us be confident that our inference is actually working as expected.

(* I will also take asymptotic bounds and sensitive finite sample heuristics because I’m not that greedy. But if I can’t run my problem, check the heuristic, and then be confident that if someone died because of my inference, it would have nothing to do with the computaition of the posterior, then it’s not enough.)

Don’t call it a comeback, I’ve been here for years

One of the weird things that I have noticed over the years is that it’s often necessary to re-visit good papers from the past so they reflect our new understanding of how statistics works.  In this case, we re-visited an excellent idea  Samantha Cook, Andrew, and Don Rubin proposed in 2006.

Cook, Gelman, and Rubin proposed a method for assessing output from software for computing posterior distributions by noting a simple fact:

If $\theta^* \sim p(\theta)$ and $y^* \sim p(y \mid \theta^*)$, then the posterior quantile $\Pr(h(\theta^*) is uniformly distributed  (the randomness is in $y^*$) for any continuous function $h(\cdot)$.

There’s a slight problem with this result.  It’s not actually applicable for sample-based inference! It only holds if, at every point, all the distributions are continuous and all of the quantiles are computed exactly.

In particular, if you compute the quantile $\Pr(h(\theta^*) using a bag of samples drawn from an MCMC algorithm, this result will not hold.

This makes it hard to use the original method in practice. That might be down-weighting the problem. This whole project happened because we wanted to run Cook, Gelman and Rubin’s procedure to compare some Stan and BUGS models. And we just kept running into problems. Even when we ran it on models that we knew worked, we were getting bad results.

So we (Sean, Michael, Aki, Andrew, and I) went through and tried to re-imagine their method as something that is more broadly applicable.

When in doubt, rank something

The key difference between our paper and Cook, Gelman, and Rubin is that we have avoided their mathematical pitfalls by re-casting their main theoretical result to something a bit more robust. In particular, we base our method around the following result.

Let $\theta^* \sim p(\theta)$ and $y^* \sim p(y \mid \theta^*)$, and $\theta_1,\ldots,\theta_L$ be independent draws from the posterior distribution $p(\theta\mid y^*)$. Then the rank of $h(\theta^*)$ in the bag of samples $h(\theta_1),\ldots,h(\theta_L)$ has a discrete uniform distribution on $[0,L]$.

This result is true for both discrete and continuous distributions. On the other hand, we now have freedom to choose $L$. As a rule, the larger $L$, the more sensitive this procedure will be. On the other hand, a larger $L$ will require more simulated data sets in order to be able to assess if the observed ranks deviate from a discrete-uniform distribution.   In the paper, we chose $L=100$ samples for each posterior.

The hills have eyes

But, more importantly, the hills have autocorrelation. If a posterior has been computed using an MCMC method, the bag of samples that are produced will likely have non-trivial autocorrelation. This autocorrelation will cause the rank histogram to deviate from uniformity in a specific way. In particular, it will lead to spikes in the histogram at zero and/or one.

To avoid this, we recommend thinning the sample to remove most of the autocorrelation.  In our experiments, we found that thinning by effective sample size was sufficient to remove the artifacts, even though this is not theoretically guaranteed to remove the autocorrelation.  We also considered using some more theoretically motivated methods, such as thinning based on Geyer’s initial positive sequences, but we found that these thinning rules were too conservative and this more aggressive thinning did not lead to better rank histograms than the simple effective sample size-based thinning.

Simulation based calibration

After putting all of this together, we get the simulation based calibration (SBC) algorithm.  The below version is for correlated samples. (There is a version in the paper for independent samples).

The simple idea is that each of the $N$ simulated datasets, you generate a bag of $L$ approximately independent samples from the approximate posterior. (You can do this however you want!) You then compute the rank of the true parameter (that was used in the simulation of the data set) within the bag of samples.  So you need to compute $N$ true parameters, each of which is used to compute one data set, which is used to compute $L$ samples from its posterior.

So. Validating code with SBC is obviously expensive. It requires a whole load of runs to make it work. The up side is that all of this runs in parallel on a cluster, so if your code is reliable, it is actually quite straightforward to run.

The place where we ran into some problems was trying to validate MCMC code that we knew didn’t work. In this case, the autocorrelation on the chain was so strong that it wasn’t reasonable to thin the chain to get 100 samples. This is an important point: if your method fails some basic checks, then it’s going to fail SBC. There’s no point wasting your time.

The main benefit of SBC is in validating MCMC methods that appear to work, or validating fast approximate algorithms like INLA (which works) or ADVI (which is a more mixed bag).

This method also has another interesting application: evaluating approximate models. For example, if you replace an intractable likelihood with a cheap approximation (such as a composite likelihood or a pseudolikelihood), SBC can give some idea of the errors that this approximation has pushed into the posterior. The procedure remains the same: simulate parameters from the prior, simulate data from the correct model, and then generate a bag of approximately uncorrelated samples from corresponding posterior using the approximate model. While this procedure cannot assess the quality of the approximation in the presence of model error, it will still be quite informative.

When You’re Smiling (The Whole World Smiles With You)

One of the most useful parts of the SBC procedure is that it is inherently visual. This makes it fairly straightforward to work out how your algorithm is wrong.  The one-dimensional rank histograms have four characteristic non-uniform shapes: “smiley”, “frowny”, “a step to the left”, “a jump to the right”, which are all interpretable.

• Histogram has a smile: The posteriors are narrower than they should be. (We see too many low and high ranks)
• Histogram has a frown: The posteriors are wider than they should be. (We don’t see enough low and high ranks)
• Histogram slopes from left to right: The posteriors are biased upwards. (The true value is too often in the lower ranks of the sample)
• Histogram slopes from right to left: The posteriors are biased downwards. (The opposite)

These histograms seem to be sensitive enough to indicate when an algorithm doesn’t work. In particular, we’ve observed that when the algorithm fails, these histograms are typically quite far from uniform. A key thing that we’ve had to assume, however, is that the bag of samples drawn from the computed posterior is approximately independent. Autocorrelation can cause spurious spikes at zero and/or one.

These interpretations are inspired by the literature on calibrating probabilistic forecasts. (Follow that link for a really detailed review and a lot of references).  There are also some multivariate extensions to these ideas, although we have not examined them here.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

Tags: , ,

 Tweet

Email: