(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Rachael Meager writes:

We’re working on a policy analysis project. Last year we spoke about individual treatment effects, which is the direction we want to go in. At the time you suggested BART [Bayesian additive regression trees; these are

notaverages of tree models as are usually set up; rather, the key is that many little nonlinear tree models are being summed; in that sense, Bart is more like a nonparametric discrete version of a spline model. —AG].But there are 2 drawbacks of using BART for this project. (1) BART predicts the outcome not the individual treatment effect – although those are obviously related and there has been some discussion of this in the econ literature. (2) It will be hard for us to back out the covariate combinations / interactions that predict the outcomes / treatment effects strongly. We can back out the important individual predictors using the frequency of appearance in the branches, but BART (and Random Forests) don’t have the easy interpretation that Trees give.

Obviously it should be possible to fit Bayesian Trees if one can fit BART. So my questions to you are:

1. Is it kosher to fit BART and also fit a Tree separately? Is there a better way?

2. Our data has a hierarchical structure (villages, implementers, countries) and it looks like trees/BART don’t have any way to declare that structure. Do you know of a way to incorporate it? Any advice/cautions here?

My reply:

– I don’t understand this statement: “BART predicts the outcome not the individual treatment effect.” Bart does predict the outcome, but the individual treatment effect is just the outcome with treatment=1, minus the outcome with treatment=0. So you get this directly. At least, that’s what I took as the message of Jennifer Hill’s 2011 paper. So I don’t see why anything new needs to be invoked here.

– Your second point is that a complicated fitted model is hard to understand: “It will be hard for us to back out the covariate combinations / interactions that predict the outcomes / treatment effects strongly.” I think you should do this using average predictive comparisons as in my paper with Pardoe. In that paper, we work with linear regressions and glms, but the exact same principle would work with Bart, I think. This might be of general interest so maybe it’s worth writing a paper on it.

– I would strongly *not* recommend “backing out the important individual predictors using the frequency of appearance in the branches.” The whole point of Bart, as I understand it, is that it is a continuous predictive model; it’s just using trees as a way to construct the nonparametric fit. In that way, Bart is like a spline: The particular functional form is a means to an end, just as in splines where what we care about is the final fitted curve, not the particular pieces used to put it together.

– I disagree that trees have an easy interpretation. I mean, sure, they seem easy to interpret, but in general they make so sense, so the apparent easy interpretation is just misleading.

– Jennifer and I have been talking about adding hierarchical structure to Bart. She might have already done it, in fact! Jennifer’s been involved in the development of a new R package that does Bart much faster and, I think, more generally, than the previously existing implementation.

In short, I suspect you can do everything you need to do with Bart already. But the multilevel modeling, there I’m not sure. One approach would be to switch to a nonparametric Bayesian model using Gaussian processes. This could be a good solution but it probably does not make so much sense here, given your existing investment in Bart. Instead I suggest an intermediate approach where you fit the model in Bart and then you fit a hierarchical linear model to the residuals to suck up some multilevel structure there.

GP, like Bart, can be autotuned. To some extent this is still a research project, but we’ve been making a lot of progress on this recently. So I don’t think this tuning issue is an inherent problem with GP’s; rather, it’s more of a problem with our current state of knowledge, but I think it’s a problem that we’re resolving.

When Jennifer says she doesn’t trust the estimate of the individual treatment effect, I think she’s saying that (a) such an estimate will have a large standard error, and (b) it will be highly model dependent. Inference for an average treatment effect can be stable, even if inferences for individual treatment effects are not.

I really don’t like the idea of counting the number of times a variable is in a tree, as a measure of importance. There are lots of problems here, most obviously that counting doesn’t give any sense of the magnitude of the prediction. More fundamentally, all variables go into a prediction, and the fact that a variable is included in one tree and not another . . . that isn’t really relevant. Again, it would be like trying to understand a spline by looking at individual components; the only purpose of the individual components is to combine to make that total prediction.

Why do trees make no sense? It depends on context. In social science, there are occasional hard bounds (for example, attitudes on health care in the U.S. could change pretty sharply around age 65) but in general we don’t expect to see such things. It makes sense for the underlying relationships to be more smooth and continuous, except in special cases where there happen to be real-life discontinuities (and in those cases we’d probably include the discontinunity directly in our model, for example by including an “age greater than 65” indicator). Again, Bart uses trees under the hood in the same way that splines use basis functions: as a tool for producing a smooth prediction surface.

**P.S.** More from Jennifer here.

The post Causal inference using Bayesian additive regression trees: some questions and answers appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**