Cool new method for detecting associations between variables in large datasets (UPDATEDx2)
Submitted by drupaladmin on 16 December 2011.Writing in Science this week, Reshef et al. present what looks like a very clever, powerful, and general method for detecting associations between variables in large (many variables) datasets. Statistician Andrew Gelman has a good write-up over at his blog, and the "Perspectives" piece in Science is also good.
Basically, the authors propose a new measure of association between two variables, ranging from 0 (no association) to 1 (perfectly associated). If the two variables are linearly related to one another, the new measure is basically equivalent to the familiar R². But the cool thing is that it works for pretty much any form of association, apparently including not just nonlinear and non-monotonic relationships between variables, but even including associations that can't be described by a single mathematical function!
As many-variable datasets become increasingly common in ecology, it's going to become increasingly important to have good tools for data exploration, since rarely will we have strong a priori hypotheses about which variables should be associated with which others, and in what way. This new approach looks like it could be just the ticket.
Some questions and food for thought (none of which are criticisms, and some of which aren't original to me; see the links above):
- How data hungry is this approach? How does it perform with relatively small numbers of observations?
- As Andrew Gelman points out, it seems that it only provides a relative measure of association, which I don't think can be sensibly compared across datasets.
- This approach doesn't free you from the problem of multiple comparisons. You're still going to have to do Bonferroni correction or look at your false discovery rate or etc. in order to separate real associations between variables from ones that just reflect random chance.
- Correlation is still not causality, no matter how much data you have, how clever your measure of correlation, and how well you correct for multiple comparisons. Especially when it's not even a measure of partial correlation. An important direction for future work will be to see if it's possible to extend this new measure of association into something analogous to partial correlation, so that you can ask about the association between two variables independent of the other variables in the dataset. That doesn't prove causality either, of course, but it can be a better hint than a plain ol' measure of association. Whether or not such an extension proves possible, I think for most applications it will be important to follow up this approach with independent checks meant to reveal the causal underpinnings of putative associations.
I propose a race: let's see who can be the first to write an R package implementing this new approach. Ethan, Jarrett, Scott, Carl, Ted: on your marks, get set...go!
UPDATE: Whoops, turns out there's already an R wrapper on the authors' site (HT Scott Chamberlin). Too bad, I was kind of looking forward to posting odds and taking bets on that race...
UPDATE #2: Hmm, looks like there's a drawback to this approach: the extreme generality is (probably not surprisingly) purchased at the cost of power. See the comment by Simon and Tibshirani here. So if you do have some reasonable a priori idea of what sort of relationships you're looking for, or what variables you expect to be related, this may not be the best approach for you.