(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Sebastian Weber writes:

Bayesian inference has to overcome tough computational challenges and thanks to Stan we now have a scalable MCMC sampler available. For a Stan model running NUTS, the computational cost is dominated by gradient calculations of the model log-density as a function of the parameters. While NUTS is scalable to huge parameter spaces, this scalability becomes more of a theoretical one as the computational cost explodes. Models which involve ordinary differential equations (ODE) are such an example, where the runtimes can be of the order of days.

The obvious speedup when using Stan is to run multiple chains at the same time on different computer cores. However, this cannot reduce the total runtime per chain, which requires within-chain parallelization.

Hence, a viable approach is to parallelize the gradient calculation within a chain. As many Bayesian models facilitate hierarchical models over groupings we can often calculate contributions to the log-likelihood separately for each of these groups.

Therefore, the concept of an embarrassingly parallel program can be applied in this setting, i.e. one can calculate these independent work chunks on separate CPU cores and then collect the results.

For reasons implied by Stan’s internals (the gradient calculation must not run in a threaded program) we are restricted in applicable techniques. One possibility is the Message Passing Interface (MPI) which spawns multiple CPU cores by firing off independent processes. A root process will send packets of work (sets of parameters) to the child nodes which do the work and then send back the results (function return values and the gradients). A first toy example shows dramatic speedups (3 ODEs, 7 parameters). That is, when going from 1 core runtime of 5.2h we can crank it down to just 17 minutes by using 20 cores (18x speedup) on a single machine with 20 cores. MPI scales also across machines and when throwing 40 cores at the problem we are down to 10 minutes which is “only” a 31x speedup (see the above plot).

Of course, the MPI approach works best on clusters with many CPU

cores. Overall, this is fantastic news for big models as this opens the door to scale out large problems onto clusters which are available nowadays in many research facilities.The source code for this prototype is on our github repository. This code should be regarded as working research code and we are currently working on bringing this feature into the main Stan distribution.

Wow. This is a big deal. There are lots of problems where this method will be useful.

**P.S.** What’s with the weird y-axis labels on that graph? I think it would work better to just go 1, 2, 4, 8, 16, 32 on both axes. I like the wall-time markings on the line, though; that helped me follow what was going on.

The post SPEED: Parallelizing Stan using the Message Passing Interface (MPI) appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**