All of Machine Learning in One Expression

January 9, 2017

(This article was originally published at No Hesitations, and syndicated at StatsBlogs.)

Sendhil Mullainathan gave an entertaining plenary talk on machine learning (ML) in finance, in Chicago last Saturday at the annual American Finance Association (AFA) meeting. (Many hundreds of people, standing room only -- great to see.) Not much new relative to the posts here, for example, but he wasn't trying to deliver new results. Rather he was trying to introduce mainstream AFA financial economists to the ML perspective. 

[Of course ML perspective and methods have featured prominently in time-series econometrics for many decades, but many of the recent econometric converts to ML (and audience members at the AFA talk) are cross-section types, not used to thinking much about things like out-of-sample predictive accuracy, etc.]

Anyway, one cute and memorable thing -- good for teaching -- was Sendhil's suggestion that one can use the canonical penalized estimation problem as a taxonomy for much of ML.  Here's my quick attempt at fleshing out that suggestion.

Consider estimating a parameter vector \( \theta \) by solving the penalized estimation problem,

\( \hat{\theta} = argmin_{\theta} \sum_{i} L (y_i - f(x_i, \theta) ) ~~s.t.~~ \gamma(\theta) \le c , \)

or equivalently in Lagrange multiplier form,

\( \hat{\theta} = argmin_{\theta} \sum_{i} L (y_i - f(x_i, \theta) ) + \lambda \gamma(\theta) . \)

(1) \( f(x_i, \theta) \) is about the modeling strategy (linear, parametric non-linear, non-parametric non-linear (series, trees, nearest-neighbor, kernel, ...)).

(2) \( \gamma(\theta) \) is about the type of regularization. (Concave penalty functions non-differentiable at the origin produce selection to zero, smooth convex penalties produce shrinkage toward 0, the LASSO penalty is both concave and convex, so it both selects and shrinks, ...)

(3) \( \lambda \) is about the strength of regularization.

(4) \( L(y_i - f(x_i, \theta) ) \) is about predictive loss (quadratic, absolute, asymmetric, ...).

Many ML schemes emerge as special cases. To take just one well-known example, linear regression with regularization by LASSO and regularization strength chosen to optimize out-of-sample predictive MSE corresponds to (1) \( f(x_i, \theta)\) linear, (2) \( \gamma(\theta) = \sum_j |\theta_j| \), (3) \( \lambda \) cross-validated, and (4) \( L(y_i - f(x_i, \theta) ) = (y_i - f(x_i, \theta) )^2 \).

Please comment on the article here: No Hesitations