Screening and False Discovery Rates

December 1, 2012

(This article was originally published at Normal Deviate, and syndicated at StatsBlogs.)

Today we have another guest post. This one is by Ryan Tibshirani an Assistant Professor in my department. (You might also want to check out the course on Optimization that Ryan teaches jointly with Geoff Gordon.)

Screening and False Discovery Rates
by Ryan Tibshirani

Two years ago, as a TA for Emmanuel Candes’ Theory of Statistics course at Stanford University, I posed a question about screening and false discovery rates on a class homework. Last year, for my undergraduate Data Mining course at CMU, I re-used the same question. The question generated some interesting discussion among the students, so I thought it would be fun to share the idea here. Depending on just how popular Larry’s blog is (or becomes), I may not be able to use it again for this year’s Data Mining course! The question is inspired by conversations at the Hastie-Tibs-Taylor group meetings at Stanford.

Consider a two class problem, genetics inspired, with {n} people and {m} gene expression measurements. The people are divided into two groups: {n/2} are healthy, and {n/2} are sick. We have {m_0} null genes (in which there is actually no underlying difference between healthy and sick patients), and {m-m_0} non-nulls (in which there actually is a difference). All measurements are independent.

Suppose that we compute a two sample {t}-statistic {t_j} for each gene {j=1,\ldots m}. We want to call some genes significant, by thresholding these {t}-statistics (in absolute value); we then want to estimate the false discovery rate (FDR) of this thresholding rule, which is

\displaystyle  \mathrm{FDR} = \mathop{\mathbb E}\left[\frac{number\  of\  null\  genes\  called\  significant}{number\ of\  genes\  called\  significant}\right].

The Benjamini-Hochberg (BH) procedure provides a way to do this, which is best explained using the {p}-values {p_1,\ldots p_m} from {t_1,\ldots t_m}, respectively. We first sort the {p}-values {p_{(1)} \leq \ldots \leq p_{(m)}}; then, given a level {q}, we find the largest {k} such that

\displaystyle  p_{(k)} <= q \frac{k}{m},

and call the {p}-values {p_{(1)},\ldots p_{(k)}} (and the corresponding genes) significant. It helps to think of this as a rule which rejects all {p}-values {p_1,\ldots p_m} satisfying {p_j \leq c}, for the cutoff {c= p_{(k)}}. The BH estimate for the FDR of this rule is simply {q}.

An alternative procedure to estimate the FDR uses a null distribution generated by permutations. This means scrambling all of the group labels (healthy/sick) uniformly at random, and then recomputing the {t}-statistics. Having done this {B} times, we let {t^{(i)}_1,\ldots t^{(i)}_m} denote the {t}-statistics computed on the {i}th permuted data set. Now consider the rule that rejects all {t}-statistics {t_1,\ldots t_m} satisfying {|t_j|>c} for some cutoff {c}. The permutation estimate for the FDR of this rule is

\displaystyle  \frac{\frac{1}{B}\sum_{i=1}^B\sum_{j=1}^m 1\{|t^{(i)}_j|>c\}} {\sum_{j=1}^m 1\{|t_j| > c\}} =   \frac{\ average\ number\   of\  null\  genes\  called\  significant\  over\  permutations}  {number\ of\  genes\  called\  significant\   in\  original\  data\  set}

How good are these estimates? To answer this, we’ll look at a simulated example, in which we know the true FDR. Here we have {n=200} patients and {m=2000} genes, {m_0=1900} of which are null. The gene expression measurements are all drawn independently from a standard normal distribution with mean zero, except for the non-null genes, where the mean was chosen to be -1 or 1 (with equal probability) for the sick patients. The plot below shows the estimates as we vary the cutoff {c} (for the BH procedure, this means varying the level {q}) versus the true FDR, averaged over 10 simulated data sets. Both estimates look quite accurate, with the BH estimate being a little conservative.


Now what happens if, before computing these estimates, we restricted our attention to a small group of genes that looked promising in the first place? Specifically, suppose that we screened for genes based on high between-group variance (between the healthy and sick groups). The idea is to only consider the genes for which there appears to a difference between the healthy and sick groups. Turning to our simluated example, we kept only 1000 of the 2000 genes with the highest between-group variance, and then computed the BH and permutation estimates as usual (as if we were given this screened set to begin with). The plot below shows that the FDR estimates are now quite bad, as they’re way too optimistic.


Here is the interesting part: if we screen by total variance (the variance of all gene expression measurements, pooling the healthy and sick groups), then this problem goes away. The logic behind screening by total variance is that, if there’s not much variability overall, then there’s probably no interesting difference between the healthy and sick groups. In our simulated example, we kept only 1000 of the 2000 genes with the highest total variance, and computed the BH and permutation estimates as usual. We can see below that both estimates of the FDR are actually pretty much as accurate as they were in the first place (with no screening performed), if not a little more conservative.


Why do you think that the estimates after screening by between-group variance and by total variance exhibit such different behaviors? I.e., why is it OK to screen by total variance but not by between-group variance? I’ll share my own thoughts in a future post.

Please comment on the article here: Normal Deviate