revised empirical HMC

Following the informed and helpful comments from Matt Graham and Bob Carpenter on our eHMC paper [arXival] last month, we produced a revised and re-arXived version of the paper based on new experiments ran by Changye Wu and Julien Stoehr. Here are some quick replies to these comments, reproduced for convenience. (Warning: this is a loooong post, much longer than usual.)

Matt:

It is argued in the paper that each NUTS integration step is twice as costly as in standard HMC as it requires evaluating both the potential energy (negative log target density) and its gradient at each step rather than just the potential energy gradient. This I think is not generally true as in most applications of HMC the potential energy gradient will be calculated using reverse-mode automatic differentiation which will generally require the original function (here the potential energy) to be calculated in a forward pass before the gradients are then calculated in the backwards pass. Each gradient evaluation therefore also gives the original function value for ‘free’ – for example this seems to be the case in Stan from the docstring of the Stan Math library `gradient` function.

Even if the gradients are manually coded, typically there will be a lot of shared computation between the original function and gradient calculations so the even if using a more optimised implementation the cost of evaluating the value and gradient of a (scalar) function together versus is unlikely to be twice as costly as evaluating just the gradient. For example in the logistic regression example in the paper the potential energy function and gradient are defined [separately] however we could compute both together at a small additional overhead of the gradient computation cost […]

Correct, this is a well-made point and using the factor 2 in the original version was pretty naïve indeed. We have removed this argument from the presentation.Thanks!

Further in the evaluations of run-time it is assumed the cost of each NUTS transition is `[number of leapfrog steps] * 2 + 1` and for a standard HMC transition `[number of leapfrog steps] + 2`. I’m assuming the additive constants are to account for the potential energy evaluations at the final / initial states for the Metropolis accept step however in reality we always have already calculated the gradient at the initial and final states in the process of running the integrator (or will calculate the gradient in the next iteration) and so as we can evaluate both gradient and value together for the same cost as just the gradient, if we always cache the potential energy values when we calculate the gradient (which Stan does I think) then the number of additional potential energy and gradient evaluations per transition for both NUTS and standard HMC is just the number of leapfrog steps.

Similarly this point is well-made. The new version of the paper removes the additional superfluous factors.

On a separate note, the implementations of the eHMC* methods in the accompanying code have a potentially incorrect treatment of the HMC transitions which produce `NaN` values for the Metropolis acceptance ratio. Rather than reject in these cases it appears that the samplers ‘retry’ – the iteration counter `i` is not incremented in these cases and the sampled state (i.e. equal to the previous value) is not added to the chain. I’m not sure this implementation defines a valid MCMC algorithm (as we are ‘downweighting’ the states at which such rejections should occur – possibly wrong here though?) but more importantly from the evaluation perspective these failed transitions are also not included in the computational cost, which given, as implemented, the integrator still proceeds to run the full number of leapfrog steps even if there is a divergence in an earlier step which causes `NaN` values to be generated, means that the computed costs are not reflective of the actual number of gradient evaluations required. In Stan and other NUTS implementations such as in PyMC3 such integrator divergences lead to early termination of the trajectory building and so save on the cost of continuing to integrate once we know we will reject – this could be implemented similarly for the eHMC methods, but even so the cost of such transitions would still be non-zero (such shortened trajectories are reflected in the Stan `n_leapfrog__` statistics used to calculate the NUTS computational cost in the code).

This was indeed a dangerous choice. As it could (should?) impact the stationary distribution when the number of not-a-number is large. The new version of Changye’s code on Github handles [we think and hope] the NaNs correctly. Retry is replaced with reject.

In the conclusion it is claimed that NUTS samples uniformly from the trajectory of states generated which is suggested is part of the reason for its relatively poorer performance. Although this is the case for the ‘naive’ implementation of NUTS given in Algorithm 2 in Hoffman and Gelman (2014) which is used to motivate the initial discussion of the algorithm, in practice a more efficient implementation detailed in Algorithm 3 is used. Rather than sampling *independently* from the uniform distribution on the set of candidate states from the trajectory which are within the slice, this variant instead uses a Markov transition kernel which leaves this uniform distribution invariant while favouring moves towards states near to the end-points of the trajectory. The NUTS implementation used in Stan (and I think PyMC3) actually no longer use a slice sampling step but instead use the multinomial / Rao-Blackwellised version described by Michael Betancourt in ‘A conceptual introduction to Hamiltonian Monte Carlo’ and equivalently to the slice-sampling case use an efficient ‘progressive’ sampling implementation which leaves the multinomial distribution over the candidate state invariant while favouring moves closer to the trajectory end-points.

This is indeed a good objection. Based on an early version of NUTS. However, since we ran an off-the-shelf version of Stan, this should no impact the comparison experiments. Accordingly, we changed the comments in the conclusion.

[later] Computing the summaries using the code you provided on a new run of 10 repeats this morning I get minimum ESS figures very close to those in your comment:

NUTS: 0.056 +/- 0.0017
eHMCq: 0.096+/- 0.0034
eHMC: 0.1070 +/- 0.0038
eHMCu: 0.083 +/- 0.0031

Once accounting for the adjusted computational cost calculations I proposed in my code as you say the ~1.5 to 2 times improvement in these figures is concordant with the ~3-4 times improvement in the corresponding values in figure 2 in your paper.

It’s interesting that the relative ordering of the performance of the eHMC* methods is quite different when comparing on the mean ESS (and similarly for the median / max)

NUTS (mean, median, max ESS):

0.1379 +/- 0.0009, 0.1482 +/- 0.0022, 0.2189 +/- 0.0050

eHMCq (mean, median, max ESS):

0.1348 +/- 0.0025, 0.1334 +/- 0.0029, 0.1837 +/- 0.0054

eHMC (mean, median, max ESS):

0.1727 +/- 0.0023, 0.1697 +/- 0.0024, 0.2664 +/- 0.0097

eHMCu (mean, median, max ESS):

0.1825 +/- 0.0013, 0.1879 +/- 0.0027, 0.2759 +/- 0.0068

From this it seems that the relatively poorer performance of eHMCu is due to lower ESSs in just a few components / dimensions. It seems that this might also explain some of the relatively poorer performance of NUTS as the performance improvements of the eHMC* methods are also a bit lower when comparing on the statistics other than the minimum, and intuitively at least eHMCu seems likely to give the most similar distribution of integration times to NUTS.

This might partly be explained by the use of an identity mass matrix as this means there is no adjustment for different relative scaling along the different dimensions, and so the lower minimum ESS for NUTS / eHMCu may be due to using integration times smaller than required for efficiently exploring the dimension(s) with the largest scale while still exploring the smaller scale dimensions well (as we may end up doing multiple traverses of these dimensions for even a partial traverse of the larger dimensions). It would be interesting to see if / how for example using an adaptively tuned diagonal mass matrix changes things.

Another most valid and relevant objection. We thus reran the experiments with an adapted diagonal matrix M. As well as with the identity matrix. The improvement is variable across coordinates and examples, with not much of an improvement brought by a tuned M over the identity.

As follow up to my previous comment I re-ran the same logistic regression experiment with an adaptively tuned diagonal mass matrix in the Stan NUTS runs and used this same diagonal mass matrix in the eHMC runs. Averaging over 10 independent runs gives the following ESS statistics (first table means, second table standard errors of mean)

Min Mean Median Max
0.0595 0.0838 0.0847 0.1097 (NUTS)
0.1226 0.1578 0.1570 0.2068 (eHMCq)
0.1550 0.2135 0.2067 0.3051 (eHMC)
0.1090 0.1653 0.1656 0.2331 (eHMCu)

Min Mean Median Max
0.0031 0.0031 0.0031 0.0032 (NUTS)
0.0020 0.0013 0.0017 0.0057 (eHMCq)
0.0024 0.0019 0.0020 0.0078 (eHMC)
0.0025 0.0023 0.0033 0.0055 (eHMCu)

The range of the per-dimension ESS estimates for the eHMCu runs has decreased from [0.083, 0.2759] to [0.1090, 0.2331] and similarly for NUTS from [0.056, 0.2189] to [0.0595, 0.1097]. Interestingly the performance improvement is more significant however for the eHMC method, with it now giving a 2.5 gain in efficiency over NUTS.

Thanks again for your help and massive investment in this evaluation!

Bob:

1. The log density is essentially free once you’re computing the gradient [you need to carry it forward to compute all the partials anyway], so the cost per leapfrog step is the same in static HMC and NUTS. Stan does reuse the log density values across leapfrog steps, then it recomputes the log density using double double values when producing the generated quantities block, but that’s much cheaper than doing the autodiff, so it’s a fraction of a gradient evaluation.

Yep, indeed, as mentioned above, this was a naïve perspective on the actual computing requirement! Hopefully stands corrected in the text of the paper.

2. The sampling isn’t uniform over the trajectory, but weighted by the log density and biased toward the last doubling. And it does now use Michael Betancourt’s extension to multinomial sampling rather than slice sampling. This also changes how acceptance rates are calculated for warmup. [Aside 1: Andrew prefers warmup because it’s a better electronic analogy—we don’t run for a long time to burn in and see if a part fails, we run to warm it up into its ready-to-go state.] [Aside 2: Michael also updated the warmup stage relative to the original NUTS paper.]

We use off-the-shelf Stan and modified accordingly the comments in the conclusion.

3. The implementation does stop evaluating the trajectory when it hits a divergence. It also stops evaluating when the last doubling has an internal U-turn and it rejects the entire last doubling to maintain reversibility. Stan now outputs the actual number of leapfrog steps evaluated per iteration.

He’s also right that rejections need to be saved and counted to maintain detailed balance, but it looks like that’s been addressed. I don’t understand Chengye’s response in Xi’an’s reply above.

I’d be more convinced this is something we should add to Stan if three things were done to the evaluation. First, the diagonal metric adaptation should be used to make sure all the computation isn’t dominated by the choice of a poorly matched unit metric. It shouldn’t be hard as the output of adapation is available after warmup from Stan.

We use this adapted matrix provided from Stan, with limited changes in the improvement.

Second, I’d like to see multiple chains were run from diffuse starting points to monitor convergence and I’d like to use Stan’s multi-chain ESS calculations (I didn’t know where the ess() function being used came from). I think these comparisons only make sense when conditioned on convergence.

As indicated in the paper, the ess() function that we used comes from the mcmcse package of  Flegal, Vats et al. And we did run multiple chains.

Finally, I’d like to see the models brought up to our best practices. This may be done anyway to make sure convergence monitoring doesn’t fail. For example, the IRT model is coded with a centered parameterization, which we know won’t even converge properly in even moderately high dimensions unless there’s a lot of data per group (or more technically if the posterior is tightly constrained—it can come from tight priors, too). Also, there isn’t a prior on the coefficients in the Bernoulli-logit example, which is super dangerous given issues of separability, etc., and can also lead to convergence issues. The stochastic volatility model should probably be reparameterized, too—as is, it introduces a lot of posterior correlation; I also don’t understand why the priors (?) are coded as they are.

Well, “best practice” seems to switch from computation to inference. It seems strange to me that we should account for the prior and correlatively for the parameterisation in assessing the algorithms. We chose these reparameterisations to avoid constrained space. I am also still [very] unclear on how to handle NaNs from a theoretical perspective.

I’m also worried about the low acceptance rate target. Stan’s default is 80% and that’s low for hard problems in that it leads to too many divergences for effective sampling (divergences can cause HMC to revert to a random walk or can cause it to freeze in some location of the posterior if they’re too bad—the right thing to do asymptotically for detailed balance). We want to get to the point where there are zero divergences to ensure we’re not biasing the posterior.

Towards preventing divergence events induced by low targeted accept probability,  the minimal value was raised to 0.6. Once again, I am perturbed by the constraint on the divergence events.

I’m not sure about all the commented code in things like the SIR model. Given that you do your own transforms without Jacobians, those priors wouldn’t be right as they are inside the comments. But then I see target += statements at the end which seem to include both the Jacobian and some expression for a prior. What’s the motivation for coding things that way?

Why not use minimum effective sample size per gradient eval? I don’t understand the normalization to use just ESS for comparison.

It’d also be nice to vectorize everything, but that’s just an issue of absolute time to make your lives easier (as in these things should run twice as fast or faster if vectorized unless the data sizes are very small). And we have a lot of functions to improve arithmetic stability vs. the handwritten stuff included in the model code.

P.S. Where was the ess() function coming from? Reading R is so confusing because nobody uses namespace qualifiers! Stan’s ESS calculations are much more conservative as they discount for non-convergence. The new ones Aki Vehtari wrote that can deal with anti-correlated draws have been better tested for calibration, so I’d suggest using the latest version (which may only be available in the GitHub version of RStan—we’re having issues getting 2.18 up on CRAN—sorry about that).

P.P.S. It would’ve helped me if all that R wasn’t cut-and-pasted, but rather sourced from one file and reused. It’s impossible to maintain something written this way.

We cannot appropriately address these coding improvements besides fixing errors as we have neither the manpower ror the skills to turn our code into a software or even into anything maintainable or maintained. Codes were made available, along with their shortcomings, as a natural part of writing a computational paper and we are glad that both Matt and Bob took the time to look at them. We see these codes as nothing more than the validation of our experimental comparison (and Changye’s PhD thesis), not as an end per se. Thanks for your time and support!