High Dimensional Undirected Graphical Models

September 17, 2012
By

(This article was originally published at Normal Deviate, and syndicated at StatsBlogs.)

High Dimensional Undirected Graphical Models
Larry Wasserman

Graphical models have now become a common tool in applied statistics. Here is a graph representing stock data:

Here is one for proteins: (Maslov and Sneppen, 2002).

And here is one for the voting records of senators: (Banerjee, El Ghaoui, and d’Aspremont ,2008).

Like all statistical methods, estimating graphs is challenging in high dimensional problems.

1. Undirected Graphical Models

Let {X=(X_1,\ldots,X_d)} be a random vector with distribution {P}. A graph {G} for {P} has {d} nodes, or vertices, one for each variable. Some pairs of vertices are connected by edges. We can represent the edges as a set {E} of unordered pairs: {(j,k)\in E} if and only if there is an edge between {X_j} and {X_k}.

We omit an edge between {X_i} and {X_j} to mean that {X_i} and {X_j} are independent, given the other variables which we write as

\displaystyle  X_i \amalg X_j | rest

where “rest” denotes the rest of the variables. For example, in this graph

\displaystyle  X_1 \ \ \rule{2in}{1mm} \ \ X_2 \ \ \rule{2in}{1mm}\ \ X_3

we see that {X_1\amalg X_3 | X_2} since there is not edge between {X_1} and {X_3}.

Now suppose we observe {n} random vectors

\displaystyle  X^{(1)},\ldots, X^{(n)} \sim P.

The goal is to estimate {G} from the data. As an example, imagine that {X^{(i)}} is a vector of gene expression levels for subject {i}. The graph gives us an idea of how the genes are related.

2. The Gaussian Case

Things are easiest if we assume that {P} is a multivariate Gaussian. Suppose {X\sim N(\mu,\Sigma)}. In this case, there is no edge between {X_j} and {X_k} iff {\Omega_{jk}=0} where {\Omega = \Sigma^{-1}}. If the dimension {d} of {X} is smaller than the sample size {n}, then estimating the graph is straightforward. We can use the sample covariance matrix

\displaystyle  S = \frac{1}{n}\sum_{i=1}^n (X^{(i)}-\overline{X})(X^{(i)}-\overline{X})^T

to estimate {\Sigma} and we use {S^{-1}} to estimate {\Omega}. It is then easy to test {H_0: \Omega_{jk}=0}. We put an edge between {j} and {k} if the test rejects. The R package SIN will do this for you.

When {d>n}, the simple approach above won’t work. Perhaps the easiest approach is the method due to Meinshausen and B{ü}hlmann (2006) which John Lafferty has aptly named parallel regression. The idea is this: you regress {X_1} on all the other variables, then you regress {X_2} on all the other variables, etc. For each regression, you use the lasso which yields sparse estimators. When you regress {X_j} on the others, there will be a regression coefficient for {X_k}. Call this {\hat\beta_{jk}}. Similarly, when you regress {X_k} on the others, there will be a regression coefficient for {X_j}. Call this {\hat\beta_{kj}}. If {\hat\beta_{jk}\neq 0} and {\hat\beta_{kj}\neq 0} then you put an edge between {X_j} and {X_k}.

An alternative — the glasso (graphical lasso) — is to maximize the Gaussian log-likelihood {\ell(\mu,\Omega)} with a penalty:

\displaystyle  \ell(\mu,\Omega) + \lambda \sum_{j\neq k} |\Omega_{jk}|.

The resulting estimator {\hat\Omega} is sparse: many of its elements are 0. The non-zeroes denote edges. A fast implementation in R due to Han Liu can be found here.

3. Relaxing Gaussianity

The biggest drawback of the glasso is the assumption of Gaussianity. With my colleagues John Lafferty and Han Liu and others, I have done some work on more nonparametric approaches.

Let me briefly describe the approach in Liu, Xu, Gu, Gupta, Lafferty and Wasserman (2011). The idea is to restrict the graph to be a forest, which is a graph with no cycles.

When {G} is a forest, the density {p} can be written as

\displaystyle  p(y) = \prod_{(j,k)\in E} \frac{p(y_j,y_k)}{p(y_j)p(y_k)}\prod_{s=1}^d p(y_s)

where {E} is the set of edges. We can estimate the density by inserting estimates of the univariate and bivariate marginals:

\displaystyle  \hat{p}(y) = \prod_{(j,k)\in E} \frac{\hat p(y_j,y_k)}{\hat p(y_j)\hat p(y_k)}\prod_{s=1}^d \hat p(y_s).

But to find the graph we need to estimate the edge set {E}.

Any forest {F} defines a density {p_F(y)} by the above equation. If the true density {p} were known, we could choose {F} to minimize the Kullback-Leibler distance

\displaystyle  D(p,p_F) = \int p(y) \log\left(\frac{p(y)}{p_F(y)} \right) dy.

This maximization can be done by Kruskal’s algorithm (Kruskal 1956) also known, in this context, as the Chow-Liu algorithm (Chow and Liu 1968). It works like this.

For each pair {(X_j,X_k)} let

\displaystyle  I(Y_j,Y_k) = \int p(y_j,y_k) \log \frac{p(y_j,y_k)}{p(y_j)p(y_k)} dy_j\, dy_k

be the mutual information between {Y_j} and {Y_k}. We start with an empty tree and add edges greedily according to the value of {I(Y_j,Y_k)}. First we connect the pair of variables with the largest {I(Y_j,Y_k)} and then the second largest and so on. However, we do not add an edge if it forms a cycle.

Of course, we don’t know {p} so, instead, we use the data to estimate the mutual information {I(Y_j,Y_k)}. If we simply run the Chow-Liu algorithm, we get a tree, that is, a fully connected forest. But we also use a hold-out sample to decide on when to stop adding edges. And that’s it.

There are other ways to relax the Gaussian assumption. For example, here is an approach that uses rank statistics and copulas.

4. The Future?

Not too long ago, estimating a high dimensional graph was unthinkable. Now they are used routinely. The biggest thing missing is a good measure of uncertainty. One can do some obvious things, such as bootstrapping, but it is not clear that the output is meaningful.

5. References

Banerjee, O. and El Ghaoui, L. and d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. The Journal of Machine Learning Research, 9, 485-516.

Chow, C. and Liu, C. (1968). IEEE Transactions on Information Theory, 14, 462-467.

Kruskal, J.B. (1956). On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society, 7, 48-50.

Liu, H., Xu, M., Gu, H., Gupta, A., Lafferty, J. and Wasserman, L. (2011). Forest Density Estimation. Journal of Machine Learning Research, 12, 907-951.

Maslov, S. and Sneppen, K. (2002). Specificity and stability in topology of protein networks. Science, 296, 910.

Meinshausen, N. and B{ü}hlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34, 1436-1462.




Please comment on the article here: Normal Deviate

Tags:

Subscribe

Email:

  Subscribe