(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)
Ilya Esteban writes:
In your blog your advice for performing regression in the presence of large numbers of correlated features, has been to use composite scores and hierarchical modeling. Unfortunately, many problems don’t provide an obvious and unambiguous way of grouping features together (e.g. gene expression data). Are there any techniques that you would recommend that automatically pool correlated features together based on the data, without requiring the researcher to manually define composite scores or feature hierarchies?
I don’t know the answer to this but I imagine something is possible . . . any ideas?
In the meantime I’m reminded of this recent article by Shaw-Hwa Lo, Haitian Wang, Tian Zheng, and Inchi Hu:
Recent high-throughput biological studies successfully identified thousands of risk factors associated with common human dis- eases. Most of these studies used single-variable method and each variable is analyzed individually. The risk factors so identified account for a small portion of disease heritability. Nowadays, there is a growing body of evidence suggesting gene–gene interactions as a possible reason for the missing heritability . . .
To address these challenges, the proposed method extracts different types of information from the data in several stages. In the first stage, we select variables with high potential to form influential variable modules when combining with other variables. In the second stage, we generate highly influential variable modules from variables selected in the first stage so that each variable interacts with others in the same module to produce a strong effect on the response Y. The third stage combines classifiers, each constructed from one module, to form the classification rule. . . .
I haven’t tried to follow all the details but it looks cool. These genetics problems are different from the social science and environmental health examples I work on. In genetics there seem to be many true zeros—that is, you really are trying to find a bunch of needles in a haystack. In my problems, nothing is really zero and we only set things to zero for computational convenience or to make our models more understandable. Hence the appeal of methods such as Bart and Gaussian processes. Shaw-Hwa’s paper is interesting in that it directly grapples with the problem of interactions.
Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science