Ridge coefficients for multiple values of the regularization parameter can be elegantly computed by updating the thin SVD decomposition of the design matrix: import numpy as np from scipy import linalg def ridge(A, b, alphas): """ ...

I haven't worked in the manifold module since last time, yet thanks to Jake VanderPlas there are some cool features I can talk about. First of, the ARPACK backend is finally working and gives factor one speedup over the lobcpg + PyAMG approach. The key...

Hey there, As we all know, there is more and more available data and more and more efficient ways to analyze them to get useful answers, about pretty much everything. So, statisticians and “data scientists” in general are usually busy people. However, if some of them ever get bored, there seems to be nice associations [...]

John von Neumann famously said With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. By this he meant that one should not be impressed when a complex model fits a data set well. With enough parameters, you can fit any data set. It turns out you can literally fit [...]

Preparing a dataset for analysis is an arduous process. Besides recoding and cleaning variables, a diligent data analyst also must assign variable labels and value labels, unless they choose to wait until after your output is exported to Microsoft Word. Unfortunately, that option only leaves additional opportunity for error and confusion, not to mention the inefficiency of editing tables in Microsoft Word. Who among us have not been frustrated while…

When Samuel Hansen said in his interview “You’re not a pure mathematician” I agreed without thinking, but later the statement bothered me a little. I know what he meant: considering the two categories of pure math and applied math, you’d put yourself in the latter category. Which is true. But the term “pure” math can be [...]

One advantage of crude models is that we know they are crude and will not try to read too much from them. With more sophisticated models, … there is an awful temptation to squeeze the lemon until it is dry and to present a picture of the future which through its very precision and verisimilitude carries [...]

On Tuesday, May 24 at 1:00pm Eastern Daylight Time, I will be presenting a webcast on behalf of JMP, a visual data exploration and mining tool. The main theme of the talk is that companies tend to manage to metrics, so it is very important ...

Hello Readers,As some of you will already have heard, I have accepted the position of Business Intelligence Director at TripAdvisor for Business--the part of TripAdvsor that sells products and services to businesses rather than consumers. The largest p...

While I was looking up the Tukey quote in my earlier post, I ran another of his quotes: The test of a good procedure is how well it works, not how well it is understood. At some level, it’s hard to argue against this. Statistical procedures operate on empirical data, so it makes sense that the procedures [...]

Here’s a recent discussion from Math Overflow. Q: I have some data points and, when I plot them on R, it looks like a normal distribution. I want to know how well my data fits the normal distribution. What kind of test should I do? A: There’s actually a much broader question that you should [...]

I've been working lately in improving the low-level API of the libsvm bindings in scikit-learn. The goal is to provide an API that encourages an efficient use of these libraries for expert users. These are methods that have lower overhead than the obje...