Screening: Everything Old Is New Again
Screening is one of the oldest methods for variable selection. It refers to doing a bunch of marginal (single covariate) regressions instead of one multiple regression. When I was in school, we were taught that it was a bad thing to do.
Now, screening is back in fashion. It’s a whole industry. And before I throw stones, let me admit my own guilt: see Wasserman and Roeder (2009).
1. What Is it?
Suppose that the data are with
To simplify matters, assume that , and . Let us assume that we are in the high dimensional case where . To perform variable selection, we might use something like the lasso.
But if we use screening, we instead do the following. We regress on , then we regress on , then we regress on . In other words, we do one-dimensional regressions. Denote the regression coefficients by . We keep the covariates associated with the largest values of . We then might do a second step such as running the lasso on the covariates that we kept.
What are we actually estimating when we regression on the covariate? It is easy to see that
and is the correlation between and .
2. Arguments in Favor of Screening
If you miss an important variable during the screening phase you are in trouble. This will happen if is big but is small. Can this happen?
Sure. You can certainly find values of the ‘s and the to make big and make small. In fact, you can make huge while making . This is sometimes called unfaithfulness in the literature on graphical models.
However, set of vectors that are unfaithful has Lebesgue measure 0. Thus, in some sense, unfaithfulness is “unlikely” and so screening is safe.
3. Arguments Against Screening
Not so fast. In order to screw up, it is not necessary to have exact unfaithfulness. All we need is approximate unfaithfulness. And the set of approximately unfaithful ‘s is a non-trivial subset of .
But it’s worse than that. Cautious statisticians want procedures that have properties that hold uniformly over the parameter space. Screening cannot be successful in any uniform sense because of the unfaithful (and nearly unfaithful) distributions.
And if we admit that the linear model is surely wrong, then things get even worse.
Screening is appealing because it is fast, easy and scalable. But it makes a strong (and unverifiable) assumption that you are not unlucky and have not encountered a case where is small but is big.
Sometimes I find the arguments in favor of screening to be appealing but when I’m in a more skeptical (sane?) frame of mind, I find screening to be quite unreasonable.
What do you think?
Wasserman, L. and Roeder, K. (2009). High dimensional variable selection. Annals of statistics, 37, 2178.
Please comment on the article here: Normal Deviate