Larry's latest post is a call for the biggest unsolved problems in statistics and machine learning. As he says, the current Wikipedia page on unsolved problems in statistics is not what he's looking for, as all the examples are either "boring", "interesting but vague", or silly (or, as he puts it, "you've got to be kidding"). As he says, a good big problem is one which can be stated succinctly, can be explained to a non-specialist, and "when you explain it to someone they think it is cool, even if they don't know what you are talking about".
I am not sure that any of these really qualify, because they're all less sexy than the "P=NP", or Navier-Stokes problems Larry has in mind from other disciplines. But they're ones which bug me, and seem like they might have more leverage than (save me) checking admissibility of some point estimate or another. I have some ideas about some of them, but hopefully my collaborators will not regard me as setting us up to be scooped by mentioning the problems. And if you know of solutions to any of these, please do tell.
- Devise an algorithm which tells us exactly how much of a statistical model can be linked to data, and how much must be taken on faith. More specifically, given a parametric model, either return that it is fully identified, or, if it's not, construct the maximal identifiable parameter, and figure out all of the parameters which are partially identified.
- Apply said algorithm to the problem of high-dimensional regression, i.e., where we have more predictor variables than data-points. How much of the regression function is identified?
- When and how does sub-sampling or re-sampling give valid generalization error bounds?
- When we don't have that much data but do have a bunch of similar cases, we often want to do partial pooling. The traditional way to do partial pooling is through hierarchical Bayesian models. Do we lose anything by instead using shrinkage penalties, with the size of the penalties picked by cross-validation? If so, what do we lose?
- How do we put valid confidence intervals on coefficients of linear models which were estimated with penalties, like ridge regression or the lasso?
- Find sufficient (and necessary?) conditions for the lasso to be model-selection-consistent which users can (a) understand and (b) check.
- In applications, everyone does multi-fold cross-validation. When and why, exactly, does it work? When will it definitely not work? How different are the answers if "work" means "predict well out of sample" or "get the model structure right"?
- Rademacher complexity puts powerful bounds on generalization error for bounded loss functions. What is the equivalent for unbounded loss functions (like squared error)?
- Three technologies which have become quite central to actually doing statistics since the 1980s are smoothing (for regression and density estimation), the bootstrap (for assessing uncertainty), and cross-validation for picking control settings. We know how to do these for independent data, and for time series, and more or less for spatial data. How do we make any of them work for networks, or for relational data more generally?
- For that matter, how do we draw big networks in a way which doesn't lead to what M. E. J. Newman calls a "ridiculogram"?
- Why does boosting work? (I incline to Schapire and Freund's margin-based explanation, but it's certainly not settled.)
- Most current ensemble methods are simply averaging/voting, perhaps with weights. What advantages, if any, are there to more elaborate aggregation mechanisms?
- If a mixture-of-experts algorithm achieves low regret, and there is a uniform generalization error bound for the experts' risks, when can we bound the ensemble's risk? (Cf. a, b.) If we can give such a bound, is there any reason to not use low-regret algorithms?
- What is the weakest dependence condition on a time series which lets us give distribution-free bounds like the IID ones? The weakest dependence condition for a random field? Can these dependence conditions be tested?
- Develop mis-specification tests where failing the test indicates how the model should be changed. (Both Neyman's smooth test and the PC algorithm are examples of what I have in mind, so this might hit the "interesting but vague" flaw.)
- What are the weakest conditions under which causal discovery is uniformly consistent? The weakest ones which could be invoked in good conscience in applications?
- Suppose that the real data-generating process is an absurdly-high-dimensional linear system, but all the variables we measure are random, comparatively-low-dimensional projections of the real variables (not necessarily onto linear surfaces). What would typical regression relations among such projections look like? How different are they from the regression functions we actually see? Should we expect sparsity under such conditions, or rather anti-sparsity?
- Give a rule for picking the number of clusters which (a) converges to the true number of clusters when there is one, and (b) predicts well even when there really isn't a finite number of clusters. (Presumably any good answer here would also work for community discovery in networks.)
- Give a similar rule for picking the dimension of a latent variable, without (necessarily) assuming linear structure.
- Many model families contain a discrete, "structural" parameter --- graphs in graphical models, trees in phylogenies or hierarchical clusterings, etc. How do we give a confidence set for such parameters? We could just test each possibility, and report the set which we don't reject, but this is too expensive because of combinatorial explosions, and would often look ugly; could we do some sort of branch-and-bound? If we try to pick the discrete structure through model selection, how do we handle not having a single chain of nested models, but rather a lattice? How do we detect significant differences in structure, or do partial pooling for structure?
Something which is definitely too vague would be "when, if ever, is parsimony a good guide to choosing between models?". So that's what I'll be talking about for the next few days.
Update, next day: Fixed some typos, added some missing links, added one problem at the end.
Update, 1 July: Simply Statistics was nice enough to link to this, observing that most (I'd say all) of these problems are "category 1" in their three-part scheme, motivated by internal considerations from statistics. I don't think any are motivated by convenient data sets (their category 2), or pressing scientific questions (category 3). For the record I think that's unfortunate; it's problems in the last category, where there's an actual issue about the world we need to resolve, which matter most. (Their "category 2" seems like exercises — perhaps very difficult exercises.) But for better or for worse, it's a lot easier for me to think of problems in the first category. What gives those problems their interest, though, is the hope that they could be used to answer lots of third-category questions...
Please comment on the article here: Three-Toed Sloth