## Ridge regression path

July 12, 2011
By

Ridge coefficients for multiple values of the regularization parameter can be elegantly computed by updating the thin SVD decomposition of the design matrix: import numpy as np from scipy import linalg def ridge(A, b, alphas): """ ...

## LLE comes in different flavours

June 30, 2011
By

I haven't worked in the manifold module since last time, yet thanks to Jake VanderPlas there are some cool features I can talk about. First of, the ARPACK backend is finally working and gives factor one speedup over the lobcpg + PyAMG approach. The key...

## Non-profit data science associations

June 28, 2011
By

Hey there, As we all know, there is more and more available data and more and more efficient ways to analyze them to get useful answers, about pretty much everything. So, statisticians  and “data scientists” in general are usually busy people. However, if some of them ever get bored, there seems to be nice associations [...]

## Installing Multiple Version of R in parallel on the same machine – Mac OS X

June 24, 2011
By

In a few days I'm going to attend a Bioconductor Course; I was requested to install on my MacBook (Mac OS X 10.5.8) a developer version of R (plus ad hoc Bioconductor packages). In order to keep my old R installation ((2.13) along side the ne...

## How to fit an elephant

June 21, 2011
By

John von Neumann famously said With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. By this he meant that one should not be impressed when a complex model fits a data set well. With enough parameters, you can fit any data set. It turns out you can literally fit [...]

## Numerical variables profiling in very large data set

June 21, 2011
By

Profiling numerical variables is an integral part of data analytics, which generally consists of obtaining standard descriptive statistics such as quantiles, first central moments as well as missing ratio.It is easily obtainable by using PROC MEANS (or...

## ChildFreq: a Tool to Explore Word Frequencies in Child Language.

June 20, 2011
By

Have you ever wondered if children prefer bananas over candy or when their fascination for dinosaurs kick in? These are the kinds of questions you can get answered on my new webpage ChildFreq. Using a huge child language database ChildFreq shows you ...

## Using Syntax to Assign ‘Variable Labels’ and ‘Value Labels’ in SPSS

June 20, 2011
By

Preparing a dataset for analysis is an arduous process. Besides recoding and cleaning variables, a diligent data analyst also must assign variable labels and value labels, unless they choose to wait until after your output is exported to Microsoft Word. Unfortunately, that option only leaves additional opportunity for error and confusion, not to mention the inefficiency of editing tables in Microsoft Word. Who among us have not been frustrated while…

## Using Syntax to Assign ‘Variable Labels’ and ‘Value Labels’ in SPSS

June 20, 2011
By

Preparing a dataset for analysis is an arduous process. Besides recoding and cleaning variables, a diligent data analyst also must assign variable labels and value labels, unless they choose to wait until after your output is exported to Microsoft Word. Unfortunately, that option only leaves additional opportunity for error and confusion, not to mention the inefficiency of editing tables in Microsoft Word. Who among us have not been frustrated while…

## Impure math

June 15, 2011
By

When Samuel Hansen said in his interview “You’re not a pure mathematician” I agreed without thinking, but later the statement bothered me a little. I know what he meant: considering the two categories of pure math and applied math, you’d put yourself in the latter category. Which is true. But the term “pure” math can be [...]

## BioStatMatt, PhD

June 12, 2011
By

Also in this picture: Mary Shotwell, PhD (right) and our Mentor Elizabeth Slate, PhD (left)

## %HPGLIMMIX macro on large scale HMM

June 7, 2011
By

* Update: The SAS code and associated paper are now published at Journal of Statistical Software .========================================================================PROC GLIMMIX is good tool for generalized linear mixed model (GLMM), when the scal...

## Manifold learning in scikit-learn

June 7, 2011
By

The manifold module in scikit-learn is slowly progressing: the locally linear embedding implementation was finally merged along with some documentation. At about the same time but in a different timezone, Jake VanderPlas began coding other manifold lea...

## Comparing HoltWinters() and ets()

I received this email today: I have a question about the ets() function in R, which I am trying to use for Holt-Winters exponential smoothing. My problem is that I am getting very different estimates of the alpha, beta and gamma parameters using ets()...

May 25, 2011
By

One advantage of crude models is that we know they are crude and will not try to read too much from them. With more sophisticated models, … there is an awful temptation to squeeze the lemon until it is dry and to present a picture of the future which through its very precision and verisimilitude carries [...]

## Sweave and pgfSweave in LyX 2.0.x (experimental)

May 25, 2011
By

Please ignore this post completely, because Sweave support has become mature in LyX since 2.0.2, and I no longer plan to add the pgfSweave module in LyX. For pgfSweave users, you may consider the new knitr module (available since 2.0.3) which uses the ...

## JMP Webcast:: Measuring What Matters

May 22, 2011
By

On Tuesday, May 24 at 1:00pm Eastern Daylight Time, I will be presenting a webcast on behalf of JMP, a visual data exploration and mining tool.  The main theme of  the talk is that companies tend to manage to metrics, so it is very important ...

May 17, 2011
By

## Works well versus well understood

May 10, 2011
By

While I was looking up the Tukey quote in my earlier post, I ran another of his quotes: The test of a good procedure is how well it works, not how well it is understood. At some level, it’s hard to argue against this. Statistical procedures operate on empirical data, so it makes sense that the procedures [...]

## Move on to the next question

May 9, 2011
By

Here’s a recent discussion from Math Overflow. Q: I have some data points and, when I plot them on R, it looks like a normal distribution. I want to know how well my data fits the normal distribution. What kind of test should I do? A: There’s actually a much broader question that you should [...]

## Handwritten digits and Locally Linear Embedding

May 4, 2011
By

I decided to test my new Locally Linear Embedding (LLE) implementation against a real dataset. At first I didn't think this would turn out very well, since LLE seems to be somewhat fragile, yielding largely different results for small differences in pa...

## Produce Authentic Math Formulas in R Graphics

April 30, 2011
By

I remember a few weeks ago, there was a challenge in the R-help list to make the prime symbol in R graphics. In LaTeX, we simply write $X'$ or $X^\prime$. R has a rough support for math expressions (see demo(plotmath)) and they are certainly unsatisfac...

## Low-level routines for Support Vector Machines

April 27, 2011
By

I've been working lately in improving the low-level API of the libsvm bindings in scikit-learn. The goal is to provide an API that encourages an efficient use of these libraries for expert users. These are methods that have lower overhead than the obje...