November 16, 2012

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Richard McElreath writes:

I’ve been translating a few ongoing data analysis projects into Stan code, mostly with success. The most important for me right now has been a hierarchical zero-inflated gamma problem. This a “hurdle” model, in which a bernoulli GLM produces zeros/nonzeros, and then a gamma GLM produces the nonzero values, using varying effects correlated with those in the bernoulli process.

The data are 20 years of human foraging returns from a subsistence hunting population in Paraguay (the Ache), comprising about 15k hunts in total (Hill & Kintigh. 2009. Current Anthropology 50:369-377). Observed values are kilograms of meat returned to camp. The more complex models contain a 147-by-9 matrix of varying effects (147 unique hunters), as well as imputation of missing values.

Originally, I had written the sampler myself in raw R code. It was very slow, but I knew what it was doing at least. Just before Stan version 1.0 was released, I had managed to get JAGS to do it all quite reliably. But JAGS was taking a long time to converge and then producing highly autocorrelated output. Stan has been amazing, in comparison. I could hardly believe the traceplots. Stan produces the same inferences as my JAGS code does, but with 8 hour runs (no thinning needed) instead of 30 hour runs (with massive thinning).

In the future, I should be getting similar data for about a dozen other foraging populations, so will want to scale this up to a meta-analytic level, with partial pooling across societies. So the improved efficiency from Stan will be a huge help going forward as well.

On the horizon, I have a harder project I’d like to port into Stan, involving cumulative multi-normal likelihoods. I wrote my own sampler, using likelihoods from pmvnorm in the mvtnorm package, but it mixes very slowly, once all the varying effects are included. Is there a clever way to get the same likelihoods in Stan yet? If not, once you have a guide prepared for how to compile in new distributions, I can probably use that to hack mvtnorm’s pmvnorm into Stan.

I’m pretty sure that with some vectorization and other steps, he can get his model to run in much less than 8 hours in Stan. But I’m happy to see that even an inefficient implementation is working well.

And Lucas Leeman writes:

I just wanted to say thank you for Stan! Thank you and your collaborators very much!

I had this problem with a very slow mixing chain and I have finally managed to get Stan to do what I want. With the mock example I am playing Stan drastically outperforms the software I was using before.

In announcing this progress, I am not trying in any way to disparage Bugs or Jags. The success of these earlier programs is what inspired us to develop Stan. A few years ago, I had the attitude that I could fit a model in Bugs, and if that didn’t work I could program it myself. Now there’s Stan. Fitting a model in Stan is essentially the same as programming it myself, except that the program has already been optimized and debugged, thus combining the convenience of Bugs with the efficiency of compiled code.

Also, again we thank the Department of Energy, Institute for Education Sciences, and National Science Foundation for partial support of this project.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

Tags: ,