Too many MC’s not enough MIC’s, or What principles should govern attempts to summarize bivariate associations in large multivariate datasets?

February 4, 2013

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Justin Kinney writes:

Since your blog has discussed the “maximal information coefficient” (MIC) of Reshef et al., I figured you might want to see the critique that Gurinder Atwal and I have posted.

In short, Reshef et al.’s central claim that MIC is “equitable” is incorrect.

We [Kinney and Atwal] offer mathematical proof that the definition of “equitability” Reshef et al. propose is unsatisfiable—no nontrivial dependence measure, including MIC, has this property. Replicating the simulations in their paper with modestly larger data sets validates this finding.

The heuristic notion of equitability, however, can be formalized instead as a self-consistency condition closely related to the Data Processing Inequality. Mutual information satisfies this new definition of equitability but MIC does not. We therefore propose that simply estimating mutual information will, in many cases, provide the sort of dependence measure Reshef et al. seek.

For background, here are my two posts (Dec 2011 and Mar 2012) on this method for detecting novel associations in large data sets. I never read the paper in detail but on quick skim it looked really cool to me. As I saw it, the clever idea of the paper is that, instead of going for an absolute measure (which, as we’ve seen, will be scale-dependent), they focus on the problem of summarizing the grid of pairwise dependences in a large set of variables. Thus, Reshef et al. provide a relative rather than absolute measure of association, suitable for comparing pairs of variables within a single dataset even if the interpretation is not so clear between datasets.

At the time, I was left with two questions:

1. What is the value of their association measure if applied to data that are on a circle? For example, suppose you generate these 1000 points in R:

n <- 1000
theta <- runif (n, 0, 2*pi)
x <- cos (theta)
y <- sin (theta)

Simulated in this way, x and y have an R-squared of 0. And, indeed, knowing x tells you little (on average) about y (and vice-versa). But, from the description of the method in the paper, it seems that their R-squared-like measure might be very close to 1. I can’t really tell. This is an interesting to me because it’s not immediately clear what the right answer “should” be. If you can capture a bivariate distribution by a simple curve, that’s great; on the other hand if you can’t predict x from y or y from x, then I don’t know that I’d want a R-squared-like summary to be close to 1.

No measure can be all things to all datasets, so let me emphasize that the above is not a criticism of the idea of Reshef et al. but rather an exploration.

2. I wonder if they’d do even better by log-transforming any variables that are all-positive. (I thought about this after looking at the graphs in Figure 4.) A more general approach would be for their grid boxes to be adaptive.

My second post reported some criticisms of the method. Reshef et al. responded in a comment.

In any case, all these methods (including the method discussed in the paper by Simon and Tibshirani) seem like a step forward from what we typically use in statistics. So this all seems like a great discussion to be having. I like how Kinney and Atwal are going back to first principles.

P.S. There was one little thing I’m skeptical of, not at all central to Kinney and Atwal’s main points. Near the bottom of page 11 they suggest that inference about joint distributions (in their case, with the goal of estimating mutual information) is not a real concern now that we are in such a large-data world. But, as we get more data, we also gain the ability and inclination to subdivide our data into smaller pieces. For example, sure, “consumer research companies routinely analyze data sets containing information on ∼ 10^5 shoppers,” but it would be helpful to break up the data and learn about different people, times, and locations, rather than computing aggregate measures of association. So I think “little-data” issues such as statistical significance and efficiency are not going away. Again, this is only a small aside in their paper but I wanted to mention the point.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

Tags: ,