Scaling of MCMC with dimension (experiments)

May 15, 2018
By

(This article was originally published at Statistics – Statisfaction, and syndicated at StatsBlogs.)

2018-05-Squarecubetesseract

Taken from Wikipedia’s article on dimension.

Hi all,

In this post, I’ll go through numerical experiments illustrating the scaling of some MCMC algorithms with respect to the dimension. I will focus on a simple setting, to illustrate some theoretical results developed by Gareth Roberts, Andrew Stuart, Alexandros Beskos, Jeff Rosenthal and many of their co-authors over many years, for instance here for random walk Metropolis-Hastings (RWMH, and see here more recently), here for Modified Adjusted Langevin Algorithm (MALA), here for Hamiltonian Monte Carlo (HMC).

The target distribution, \pi here will be a standard multivariate Normal distribution in dimension d, with zero mean, and identity covariance matrix: \pi is \mathcal{N}(0,I). All chains will be started at stationarity: the first state of the chain is sampled from the target distribution, so there’s no burn-in or warm-up phase.

An MCMC algorithm generates a Markov chain, so that after T iterations, (X_t)_{t=1}^T which can be used to estimate integrals with respect to \pi. We will focus on the task of estimating the second moment of the first marginal of \pi. So we look at the average V_T = T^{-1} \sum_{t=1}^T X_t(1)^2, which converges to 1 as T \to \infty. Here X_t(1) represents the first component of the t-th iteration of the Markov chain, which lives in a d-dimensional space. Focusing on a fixed-dimensional marginal of the target distribution enables comparisons across dimensions.

After T iterations, the MCMC average V_T won’t be exactly equal to one. Its expectation is exactly one, because the chain starts at stationarity, but there is some variability due to the random nature of these algorithms. Below I’ll plot mean squared errors, defined as the expected squared difference between the estimator V_T and the estimand 1, i.e. \mathcal{E}_T = \mathbb{E}[(V_T-1)^2]. This expectation cannot be computed analytically but, we can generate independent copies of V_T and approximate \mathcal{E}_T with a sample average over the copies (below I’ll generate 500 copies).

This setup can be used to see how algorithms behave in increasing dimensions. A complication is that most MCMC algorithms have tuning parameters. How do we tune the algorithms in such a way that they achieve stable performance as the dimension increases? If we are not careful with the choice of tuning parameters, the algorithms would eventually stop working as the dimension increases. This is where the diffusion limit results are helpful: they provide choices of tuning parameters that make the algorithms work in all dimensions.

Consider first a Metropolis-Hastings algorithm with Gaussian random walk proposal. At each iteration, given X_t, the next state is generated as follows:

  • draw X^\star \sim \mathcal{N}(X_t, h^2 I)
  • draw U \sim \mathcal{U}(0,1)
  • if \log U < (\log \pi(X^\star) - \log \pi(X_t)), set X_{t+1} = X^\star (this is an acceptance), otherwise set X_{t+1} = X_t (this is a rejection).

This paper shows that if we choose h = C d^{-1/2}, where C is a constant, and d is the dimension, then the acceptance rate of the algorithm converges to a constant which is neither 0 or 1 as the dimension increases. I’ll use C=1 below, which is not optimal. With this scaling of h, we take a number of iterations T that grows linearly with d, i.e. T = 1000d for instance. Then the mean squared error \mathcal{E}_T should be stable in the dimension. Does that work? The plot below shows acceptance rate and mean squared error being stable with the dimension.

2018-05.scaling.rwmh.png

So overall, each iteration has a cost of order d (evaluating x\mapsto \log \pi(x) costs of the order of d operations), the total complexity of the algorithm can be said to be of the order of d^2.

Now what about MALA? According to this paper, if we take the stepsize to be C d^{-1/6}, then the acceptance rate will be stable; then if we take a number of steps T of the order of d^{1/3}, the mean squared error should be stable. Below I’ve used C = 1, and T = 1000\times (1 + \lfloor d^{1/3} \rfloor). The overall complexity is thus of order d^{4/3}.

2018-05.scaling.mala.png

For HMC, following this paper, we can fix the integration time of each solution of Hamilton’s equations to be constant.If we pick a stepsize in the leapfrog integrator of the order of d^{-1/4}, incurring d^{1/4} leapfrog steps to reach a constant integration time, then the acceptance rate will be stable. Since the integration time is fixed, we can use a fixed number of MCMC steps, say T= 1000, in all dimensions. The overall complexity is of order d^{5/4}: each iteration involves d^{1/4} leap-frog steps, which each costs d operations. Does it work? Below I took the stepsize \varepsilon to be d^{-1/4}, the number of leapfrog steps to be 1 + \lfloor \varepsilon^{-1} \rfloor .

2018-05.scaling.hmc.png

The plot for the MSE of HMC is a bit rugged because the number of leapfrog steps varies discontinuously (it needs to increase with the dimension but it also needs to be an integer). Note that the plots of MSE above are not comparable across samplers because they’re not adjusted for computing time, and because the constants are not optimally tuned. The aim is simply to illustrate that the scalings obtained in the literature can be indeed checked with numerical experiments.

So on this i.i.d. Gaussian target, HMC scales better than MALA, which scales better than RWMH. The literature contains various generalizations to non-Gaussian and non-product distributions, i.e. distributions that do not factorize as a product of marginals, see e.g. here. Qualitatively, all these algorithms deteriorate with the dimension, in a polynomial fashion, and the difference is in the exponent.

What about Gibbs samplers? It’s hard to talk about Gibbs sampler in very generic terms, as it is mostly used in specific forms tailored for applications of interest. And it can be indeed very useful. If we were to perform RWMH steps to update each component of the chain given the others, in a systematic scan fashion, then we could use a fixed step size for each component. Thus we would not need to scale the number of iterations (i.e. full sweeps) with the dimension. However, each iteration would involve a sweep over all d components, and furthermore, each target evaluation would, in general, involve d operations (if we don’t know how to evaluate the conditional target density, we have to evaluate the joint target density). This would result in a cost of the order of d^2 operations, which is the same as RWMH above. However, if we knew how to compute the target conditional densities in the order of one operation, then the scaling of this Gibbs sampler would be linear in d, which beats all the samplers mentioned above. Specific, case-by-case information about conditional distributions seems very important to a successful implementation of Gibbs samplers. The rewards are high in high-dimensional settings. Beyond the toy example considered here, see Scalable MCMC for Bayes Shrinkage Priors, by Johndrow, Orenstein, Bhattacharya for a recent application in a regression model in high dimension.

Possible future blog posts might cover  scaling of methods based on importance sampling, scaling of software package implementations of MCMC algorithms, and scalings of couplings of MCMC algorithms.



Please comment on the article here: Statistics – Statisfaction

Tags:


Subscribe

Email:

  Subscribe