“Otto eye-balled the diva lying comatose amongst the reeds, and he suddenly felt the fire of inspiration flood his soul. He ran back to his workshop where he futzed and futzed and futzed.” –Bette Midler
Andrew was annoyed. Well, annoyed is probably too strong a word. Maybe a better way to start is with The List. When Andrew, Aki, and I work together we have The List of projects that need to be done and not every item on this list weighted the same by all of us.
The List has longer term ideas that we’ve been slouching towards, projects that have stalled, small ideas that have room to blossom, and then there’s my least favourite part. My least favourite part of The List is things that are finished but haven’t been written up as papers yet. This is my least favourite category because I never think coercing something into a paper is going to be any fun.
Regular readers of my blogs probably realize that I am frequently and persistently wrong.
But anyway. It was day one of one of Aki and my visits to Columbia and we were going through The List. Andrew was pointing out a project that had been sitting on The List for a very very long time. (Possibly since before I was party to The List.) And he wanted it off The List.
(Side note: this is the way all of our projects happen. Someone suddenly wants it done enough that it happens. Otherwise it stays marinating on The List.)
So let’s start again.
Andrew wanted a half-finished paper off The List and he had for a while. Specifically, the half-finished paper documenting the actual way that the Stan project computes (aka the Potential Scale Reduction Factor or, against our preference of not naming things after people, the Gelman-Rubin(-Brooks) statistic). So we agreed to finish it and then moved on to some more exciting stuff.
What is R-hat?
R-hat, or the potential scale reduction factor, is a diagnostic that attempts to measure whether or not an MCMC algorithm1 has converged. The basic idea is that you want to check a couple of things:
- Is the distribution of the first part of a chain (after warm up) the same as the distribution of the second half of the chain?
- If I start the algorithm at two different places and let the chain warm up, do both chains have the same distribution?
Historically, there was a whole lot of chat about whether or not you need to run multiple chains to compute R-hat. To summarize that extremely long conversation: you do. Why? To paraphrase the great statistician Vanessa Williams:
Sometimes the snow comes down in June
Sometimes the sun goes ’round the moon
Sometimes the posterior is multimodal
Sometimes the adaptation you do during warm up is unstable
Also it’s 2019 and all of our processors are multithreaded so just do it.
The procedure, which is summarized in the paper (and was introduced in BDA3), computes a single number summary and we typically say our Markov Chains have converged if .
A few things before we tear the whole thing down.
- Converging to the stationary distribution is the minimum condition required for a MCMC algorithm to be useful. R-hat being small doesn’t mean the chain is mixing well, so you need to check the effective sample size!
- R-hat is a diagnostic and not a proof of convergence. You still need to look at all of the other things (like divergences and BFMI in Stan) as well as diagnostic plots (more of which are in the paper)
- The formula for R-hat in BDA3 assumes that the stationary distribution has finite variance. This is a very hard property to check from a finite sample.
The third point is how Aki broke R-hat.
When does R-hat break?
The thing about turning something into a paper is that you need to have a better results section than you typically need for other purposes. So Aki went off and did some quick simulations and something bad happened. When he simulated from four chains where one had the wrong variance, the R-hat value was still near one. So R-hat was not noticing the variance was wrong. (This is the top row of the above figure. The left column has one bad chain, the right column four correct chains.)
On the other hand, R-hat was totally ok noticing when the location parameter of one chain had the wrong location parameter. Except for the bottom row of the figure, where the target distribution is Cauchy.
So we noticed two things:
- The existing R-hat diagnostic was only sensitive to errors in the first moment.
- The existing R-hat diagnostic failed catastrophically when the variance was infinite.
(Why “catastrophically”? Because it always says the chain is good!)
So clearly we could no longer just add two nice examples to the text that was written and then send the paper off. So we ran back to our workshop where we futzed and futzed and futzed.
He tried some string and paper clips…
Well, we2 came up with some truly terrible ideas but eventually circled around to two observations:
- Folding the draws by computing and computing R-hat on the folded chain will give a statistic that is sensitive to changes in scale.
- Rank-based methods are robust against fat tails. So perhaps we could rank-normalize the chain (ie compute the rank for each draw inside the total pool of samples and replace the true value with the quantile of a standard normal that corresponds to the rank).
Putting these two together, we get our new R-hat value: after rank-normalizing the chains, compute the standard R-hat and the folded R-hat and report the maximum of the two values. These are the blue histograms in the picture.
There are two advantages of doing this:
- The new value of R-hat is robust against heavy tails and is sensitive to changes in scale between the chains.
- The new value of R-hat is parameterization invariant, which is to say that the R-hat value for and will be the same. This was not a property of the original formulation.
What does the rank-normalization actually do?
Great question imaginary bold-face font person! The intuitive answer is that it is computing the R-hat value for the nicest transformation of the parameter. (Where nicest is problematically defined to be “most normal”). So what does this R-hat tell you? It tells you that if the MCMC algorithm has converged after we strip away all of the problems with heavy tails and skewness and all that jazz. Similarly we can compute an Effective Sample Size (ESS) for this nice scenario. This is the best case scenario R-hat and ESS. If it isn’t good, we have no hope.
Assuming the rank-normalized and folded R-hat is good and the rank-normalized ESS is good, it is worth investigating the chain further.
R-hat does not give us all the information we need to assess if the chain is useful
The old version of R-hat basically told us if the mean was ok. The new version tells us if the median and the MAD are ok. But that’s not the only thing we usually report. Typically a posterior is summarized by a measure of centrality (like the median) and a quantile-based uncertainty interval (such as the 0.025 and 0.975 quantiles of the posterior). We need to check that these quantiles are computed correctly!
This is not trivial: MCMC algorithms do not explore the tails as well as the bulk. This means that the Monte Carlo error in the quantiles may potentially be much higher than the Monte Carlo error in the mean. To deal with this, we introduced a localized measure of quantile efficiency, which is basically an effective sample size for computing a quantile. Here’s an example from the online appendix, where the problem is sampling from a Cauchy distribution using the “nominal” parameterization. You can see that it’s possible that central quantiles are being resolved well, but extreme quantile estimates will be very noisy.
Maybe it’s time to let traceplots die
As we are on somewhat of a visualization kick around these parts, let’s talk about traceplots of MCMC algorithms. They’re terrible. If the chain is long, all the interesting information is compressed, and if you try to include information from multiple chains it just becomes a mess. So let us propose an alternative: rank plots.
The idea is that if the chains are all mixing well exploring the same distribution, the ranks should be uniformly distributed. In the following figure, 4 chains of 1000 draws are plotted and you can easily see that the histograms are not uniform. Moreover, the histogram for the first chain clearly never visits the left tail of the distribution, which is indicative of a funnel. This would be harder to see with 4 traceplots plotted over each other.
How to use these new tools in practice
To close off, here are our recommendations (taken directly from the paper) for using R-hat. All of the methods in this paper will make their way into future version of RStan, rstanarm, and bayesplot (as well as all the other places we put things).
In this section [of the paper] we lay out practical recommendations for using the tools developed in this paper. In the interest of specificity, we have provided numerical targets for both R-hat and effective sample size (ESS). However, these values should be adapted as necessary for the given application.
In Section 4, we propose modifications to R-hat based on rank-normalizing and folding the posterior draws. We recommend running at least four chains by default and only using the sample if R-hat < 1.01. This is a much tighter threshold than the one recommended by Gelman and Rubin (1992), reflecting lessons learnt over more than 25 years of use.
Roughly speaking, the ESS of a quantity of interest captures how many independent draws contain the same amount of information as the dependent sample obtained by the MCMC algorithm. Clearly, the higher the ESS the better. When there might be difficulties with mixing, it is important to use between-chain information in computing the ESS. For instance, in the sorts of funnel-shaped distributions that arise with hierarchical models, differences in step size adaptation can lead to chains to have different behavior in the neighborhood of the narrow part of the funnel. For multimodal distributions with well-separated modes, the split-R-hat adjustment leads to an ESS estimate that is close to the number of distinct modes that are found. In this situation, ESS can be drastically overestimated if computed from a single chain.
As Vats and Knudson (2018) note, a small value of R-hat is not enough to ensure that an MCMC procedure is useful in practice. It also needs to have a sufficiently large effective sample size. As with R-hat, we recommend computing the ESS on the rank-normalized sample. This does not directly compute the ESS relevant for computing the mean of the parameter, but instead computes a quantity that is well defined even if the chains do not have finite mean or variance. Specifically, it computes the ESS of a sample from a normalized version of the quantity of interest, using the rank transformation followed by the normal inverse-cdf. This is still indicative of the effective sample size for computing an average, and if it is low the computed expectations are unlikely to be good approximations to the actual target expectations. We recommend requiring that the rank-normalized ESS is greater than 400. When running four chains, this corresponds to having a rank-normalized effective sample size of at least 50 per split chain.
Only when the rank-normalized and folded R-hat values are less than the prescribed threshold and the rank- normalized ESS is greater than 400 do we recommend computing the actual (not rank-normalized) effective sample size for the quantity of interest. This can then be used to assess the Monte Carlo standard error (MCSE) for the quantity of interest.
Finally, if you plan to report quantile estimates or posterior intervals, we strongly suggest assessing the convergence of the chains for these quantiles. In Section 4.3 we show that convergence of Markov chains is not uniform across the parameter space and propose diagnostics and effective sample sizes specifically for extreme quantiles. This is different from the standard ESS estimate (which we refer to as the “bulk-ESS”), which mainly assesses how well the centre of the distribution is resolved. Instead, these “tail-ESS” measures allow the user to estimate the MCSE for interval estimates.
1 R-hat can be used more generally for any iterative simulation algorithm that targets a stationary distribution, but it’s main use case is MCMC.
2 I. (Other people’s ideas were good. Mine were not.)