# A debate about robust standard errors: Perspective from an outsider

December 27, 2017
By

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

A colleague pointed me to a debate among some political science methodologists about robust standard errors, and I told him that the topic didn’t really interest me because I haven’t found a use for robust standard errors in my own work.

My colleague urged me to look at the debate more carefully, though, so I did. But before getting to that, let me explain where I’m coming from. I won’t be trying to make the “Holy Roman Empire” argument that they’re not robust, not standard, and not an estimate of error. I’ll just say why I haven’t found those methods useful myself, and then I’ll get to the debate.

The paradigmatic use case goes like this: You’re running a regression to estimate a causal effect. For simplicity suppose you have good identification and also suppose you have enough balance that you can consider your regression coefficient as some reasonably interpretable sort of average treatment effect. Further assume that your sample is representative enough, or treatment interactions are low enough, that you can consider the treatment effect in the sample as a reasonable approximation to the treatment effect in the population of interest.

But . . . your data are clustered or have widely unequal variances, so the assumption of a model plus independent errors is not appropriate. What you can do is run the regression, get an estimate and standard error, and then use some method of “robust standard errors” to inflate the standard errors so you get confidence intervals with close to nominal coverage.

That all sounds reasonable. And, indeed, robust standard errors are a popular statistical method. Also, speaking more generally, I’m a big fan of getting accurate uncertainties. See, for example, this paper, where Houshmand Shirani-Mehr, David Rothschild, Sharad Goel, and I argue that reported standard errors in political polls are off by approximately a factor of 2.

But this example also illustrates why I’m not so interested in robust standard errors: I’d rather model the variation of interest (in this case, the differences between polling averages and actual election outcomes) directly, and get my uncertainties from there.

This all came up because a colleague pointed me to an article, “A Note on ‘How Robust Standard Errors Expose Methodological Problems They Do Not Fix, and What to Do About It,'” by Peter Aronow. Apparently there’s some debate about all this going on among political methodologists, but it all seems pointless to me.

Let me clarify: it all seems pointless to me because I’m not planning to use robust standard errors: I’ll model my clustering and unequal variances directly, to the extent these are relevant to the questions I’m studying. That said, I recognize that many researchers, for whatever reason, don’t want to model clustering or unequal variances, and so for them a paper like Aronow’s can be helpful. So I’m not saying this debate among the political methodologists is pointless: it could make a difference when it comes to a lot of actual work that people are doing. Kinda like a debate about how best to format large tables of numbers, or how to best use Excel graphics, or what’s the right software for computing so-called exact p-values (see section 3.3 of this classic paper to hear me scream about that last topic), or when the local golf course is open, or what’s the best car repair shop in the city, or who makes the best coffee, or which cell phone provider has the best coverage: all these questions could make a difference to a lot of people, just not me.

Unraveling some confusion about the distinction between modeling and inference

One other thing. Aronow writes:

And thus we conclude . . . in light of Manski (2003)’s Law of Decreasing Credibility: “The credibility of inference decreases with the strength of the assumptions maintained.” Taking Manski’s law at face value, then a semiparametric model is definitionally more credible than any assumed parametric submodel thereof.

What the hell does this mean: “Taking Manski’s law at face value”? Is it a law or is it not a law? How do you measure “credibility” of inference or strength of assumptions?

Or this: “a semiparametric model is definitionally more credible than any assumed parametric submodel thereof”? At first this sounds kinda reasonable, even rigorous with those formal-sounding words “definitionally” and “thereof.” I’m assuming that Aronow is follwing Manksi and referring to the credibility of inferences from these models. But then there’s a big problem, because you can flip it around and get “Inference from a parametric model is less credible than inference from any semiparametric model that includes it.” And that can’t be right. Or, to put it more carefully, it all depends how you fit that semiparametric model.

Just for example, and not even getting into semiparametrics, you can get some really really goofy results if you indiscriminately fit high-degree polynomials when fitting discontinuity regressions. Now, sure, a nonparametric fit should do better—but not any nonparametric fit. You need some structure in your model.

And this reveals the problem with Aronow’s reasoning in that quote. Earlier in his paper, he defines a “model” as “a set of possible probability distributions, which is assumed to contain the distribution of observable data.” By this he means the distribution of data conditional on (possibly infinite-dimensional) parameters. No prior distribution in that definition. That’s fine: not everyone has to be Bayesian. But if you’re not going to be Bayesian, you can’t speak in general about “inference” from a model without declaring how you’re gonna perform that inference. You can’t fit an infinite-parameter model to finite data using least squares. You need some regularization. That’s fine, but then you get some tangled questions, such as comparing class of distributions that’s estimated using too-weak regularization, to a narrower class that’s estimated more appropriately. It makes no sense in general to that inference in that first case is “more credible,” or even “definitionally more credible.” It all depends on what you’re doing. Which is why we’re not all running around fitting 11th-degree polynomials to our data, and why we’re not all sitting in awe of whoever fit a model that includes ours as a special case. You don’t have to be George W. Cantor to know that’s a mug’s game. And I’m pretty sure Aronow understands this too; I suspect he just got caught up in this whole debating thing and went a bit too far.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

Tags: , ,

 Tweet

Email: