The punch line
“Your readers are my target audience. I really want to convince them that it makes sense to divide regression coefficients by 2 and their standard errors by sqrt(2). Of course, additional prior information should be used whenever available.”
It started with an email from Erik van Zwet, who wrote:
In 2013, you wrote about the hidden dangers of non-informative priors:
Finally, the simplest example yet, and my new favorite: we assign a non-informative prior to a continuous parameter theta. We now observe data, y ~ N(theta, 1), and the observation is y=1. This is of course completely consistent with being pure noise, but the posterior probability is 0.84 that theta>0. I don’t believe that 0.84. I think (in general) that it is too high.
I agree – at least if theta is a regression coefficient (other than the intercept) in the context of the life sciences.
In this paper [which has since been published in a journal], I propose that a suitable default prior is the normal distribution with mean zero and standard deviation equal to the standard error SE of the unbiased estimator. The posterior is the normal distribution with mean y/2 and standard deviation SE/sqrt(2). So that’s a default Edlin factor of 1/2. I base my proposal on two very different arguments:
1. The uniform (flat) prior is considered by many to be non-informative because of certain invariance properties. However, I argue that those properties break down when we reparameterize in terms of the sign and the magnitude of theta. Now, in my experience, the primary goal of most regression analyses is to study the direction of some association. That is, we are interested primarily in the sign of theta. Under the prior I’m proposing, P(theta > 0 | y) has the standard uniform distribution (Theorem 1 in the paper). In that sense, the prior could be considered to be non-informative for inference about the sign of theta.
2. The fact that we are considering a regression coefficient (other than the intercept) in the context of the life sciences is actually prior information. Now, almost all research in the life sciences is listed in the MEDLINE (PubMed) database. In the absence of any additional prior information, we can consider papers in MEDLINE that have regression coefficients to be exchangeable. I used a sample of 50 MEDLINE papers to estimate the prior and found the normal distribution with mean zero and standard deviation 1.28*SE. The data and my analysis are available here.
The two arguments are very different, so it’s nice that they yield fairly similar results. Since published effects tend to be inflated, I think the 1.28 is somewhat overestimated. So, I end up recommending the N(0,SE^2) as default prior.
I think it makes sense to divide regression coefficients by 2 and their standard errors by sqrt(2). Of course, additional prior information should be used whenever available.
Hmmm . . . one way to think about this idea is to consider where it doesn’t make sense. You write, “a suitable default prior is the normal distribution with mean zero and standard deviation equal to the standard error SE of the unbiased estimator.” Let’s consider two cases where this default won’t work:
– The task is to estimate someone’s weight with one measurement on a scale where the measurements have standard deviation 1 pound, and you observe 150 pounds. You’re not going to want to partially pool that all the way to 75 pounds. The point here, I suppose, is that the goal of the measurement is not to estimate the sign of the effect. But we could do the same reasoning where the goal was to estimate the sign. For example, I weigh you, then I weigh you again a year later. I’m interested in seeing if you gained or lost weight. The measurement was 150 pounds last year and 140 pounds this year. The classical estimate of the difference of the two measurements is 10 +/- 1.4. Would I want to partially pool that all the way to 5? Maybe, in that these are just single measurements and your weight can fluctuate. But that can’t be the motivation here, because we could just as well take 100 measurements at one time and 100 measurements a year later, so now maybe your average is, say, 153 pounds last year and 143 pounds this year: an estimated change of 10 +/- 0.14. We certainly wouldn’t want to use a super-precise prior with mean 0 an sd 0.14 here!
– The famous beauty-and-sex-ratio study where the difference in probability of girl birth, comparing children of beautiful and non-beautiful parents, was estimated from some data to be 8 percentage points +/- 3 percentage points. In this case, an Edlin factor of 0.5 is not enough. Pooling down to 4 percentage points is not enough pooling. A better estimate would of the difference be 0 percentage points, or 0.01 percentage points, or something like that.
I guess what I’m getting at is that the balance between prior and data changes as we get more information, so I don’t see how a fixed amount of partial pooling can work.
That said, maybe I’m missing something here. After all, a default can never cover all cases, and the current default of no partial pooling or flat prior has all sorts of problems. So we can think more about this.
P.S. In the months since I wrote the above post, Zwet sent along further thoughts:
Since I emailed you in the fall, I’ve continued thinking about default priors. I have a clearer idea now about what I’m trying to do:
In principle, one can obtain prior information for almost any research question in the life sciences via a meta-analysis. In practice, however, there are (at least) three obstacles. First, a meta-analysis is extra work and that is never popular. Second, the literature is not always reliable because of publication bias and such. Third, it is generally unclear what the scope of the meta-analysis should be.
Now, researchers often want to be “objective” or “non-informative”. I believe this can be accomplished by performing a meta-analysis with a very wide scope. One might think that this would lead to very diffuse priors, but that turns out not to be the case! Using a very wide scope to obtain prior information also means that the same meta-analysis can be recycled in many situations.
The problem of publication bias in the literature remains, but there may be ways to handle that. In the paper I sent earlier, I used p-values from univariable regressions that were used to “screen” variables for a multivariable model. I figure that those p-values should be largely unaffected by selection on significance, simply because that selection is still to be done!
More recently, I’ve used a set of “honest” p-values that were generated by the Open Science Collaboration in their big replication project in psychology (Science, 2015). I’ve estimated a prior and then computed type S and M errors. I attach the results together with the (publicly available) data. The results are also here.
It’s an appealing idea, in practice should be better than the current default Edlin factor of 1 (that is, no partial pooling toward zero at all). And I’ve talked a lot about constructing default priors based on empirical information, so it’s great to see someone actually doing it. Still, I have some reservations about the specific recommendations, for the reasons expressed in my response to Zwet above. Like him, I’m curious about your thoughts on this.
I’ll also wrote something on this in our Prior Choice Recommendations wiki:
Default prior for treatment effects scaled based on the standard error of the estimate
Erik van Zwet suggests an Edlin factor of 1/2. Assuming that the existing or published estimate is unbiased with known standard error, this corresponds to a default prior that is normal with mean 0 and sd equal to the standard error of the data estimate. This can’t be right–for any given experiment, as you add data, the standard error should decline, so this would suggest that the prior depends on sample size. (On the other hand, the prior can often only be understood in the context of the likelihood; http://www.stat.columbia.edu/~gelman/research/published/entropy-19-00555-v2.pdf, so we can’t rule out an improper or data-dependent prior out of hand.)
Anyway, the discussion with Zwet got me thinking. If I see an estimate that’s 1 se from 0, I tend not to take it seriously; I partially pool it toward 0. So if the data estimate is 1 se from 0, then, sure, the normal(0, se) prior seems reasonable as it pools the estimate halfway to 0. But if the data estimate is, say, 4 se’s from zero, I wouldn’t want to pool it halfway: at this point, zero is not so relevant. This suggests something like a t prior. Again, though, the big idea here is to scale the prior based on the standard error of the estimate.
Another way of looking at this prior is as a formalization of what we do when we see estimates of treatment effects. If the estimate is only 1 standard error away from zero, we don’t take it too seriously: sure, we take it as some evidence of a positive effect, but far from conclusive evidence–we partially pool it toward zero. If the estimate is 2 standard errors away from zero, we still think the estimate has a bit of luck to it–just think of the way in which researchers, when their estimate is 2 se’s from zero, (a) get excited and (b) want to stop the experiment right there so as not to lose the magic–hence some partial pooling toward zero is still in order. And if the estimate is 4 se’s from zero, we just tend to take it as is.
I sent some of the above to Zwet, who replied:
I [Zwet] proposed that default Edlin factor of 1/2 only when the estimate is less than 3 se’s away from zero (or rather, p<0.001). I used a mixture of two zero-mean normals; one with sd=0.68 and the other with sd=3.94. I’m quite happy with the fit. The shrinkage is a little more than 1/2 when the estimate is close to zero, and disappears gradually for larger estimates. It’s in the data! You can see it when you do a “wide scope” meta-analysis.