Regularized Discriminant Analysis

April 10, 2011
By
Regularized Discriminant Analysis

Demo SAS implementation of Regularized (Linear) Discriminate Analysis of J. Friedman (1989)[1]. Simpler introduction can be found at [2]. Regularized QDA follows similarly.To save coding, I called R within SAS to finish the computation. For details to ...

Read more »

New versions of GGobi and rggobi for Windows users

April 9, 2011
By

For those who have been struggling with the installation of GGobi and the rggobi package under Windows: a major update of GGobi 2.1.9 is that GTK+ has been bundled with GGobi, so the installation of GTK+ is no longer required (I recommend you to uninst...

Read more »

A profiler for Python extensions

April 6, 2011
By
A profiler for Python extensions

Profiling Python extensions has not been a pleasant experience for me, so I made my own package to do the job. Existing alternatives were either hard to use, forcing you to recompile with custom flags like gprofile or desperately slow like valgrind/cal...

Read more »

scikit-learn coding sprint in Paris

April 2, 2011
By

Yesterday was the scikit-learn coding sprint in Paris. It was great to meet with old developers (Vincent Michel) and new ones: some of whom I was already familiar with from the mailing list while others came just to say hi and get familiar with the cod...

Read more »

Data Mining Techniques 3rd Edition

April 1, 2011
By
Data Mining Techniques 3rd Edition

Gordon and I spent much of the last year writing the third edition of Data Mining Techniques and now, at last, I am holding the finished product in my hand. In the 14 years since the first edition came out, our knowledge has increased by a factor of at...

Read more »

ElasticNet in SAS

April 1, 2011
By
ElasticNet in SAS

Try out Elastic Net [1] in normal linear regression, using Naive algorithm. Exploring possibilities for GLM Elastic Net in SAS. 1. Zou, H and Hastie, T (2005). Regularization and variable Selection via the Elastic Net, Journal Of The Royal Statistica...

Read more »

I’m switching to TeXstudio

I’m switching to TeXstudio

I’ve happily used WinEdt for all my LaTeX editing for about 15 years and I’ve encouraged my whole research team to use it. But I’m tired of problems with WinEdt that take up my time.I regularly have requests for help from one of my r...

Read more »

py3k in scikit-learn

March 28, 2011
By

One thing I'd really like to see done in this Friday's scikit-learn sprint is to have full support for Python 3. There's a branch were the hard word has been done (porting C extensions, automatic 2to3 conversion, etc.), although joblib still has some b...

Read more »

Looking after your supervisor

Looking after your supervisor

Some good advice here: The care and maintenance of your adviser.

Read more »

How to calculate R-squared for a decision tree model

March 22, 2011
By

A client recently wrote to us saying that she liked decision tree models, but for a model to be used at her bank, the risk compliance group required an R-squared value for the model and her decision tree software doesn't supply one. How should she fill...

Read more »

Update on the parallel IMH article

March 16, 2011
By
Update on the parallel IMH article

Hey, Last October, I blogged about an article written by Christian P. Robert, Murray Smith and myself about parallel computation, Independent Metropolis-Hastings and Rao-Blackwellization. The article advocates the use of parallel computation and the method described, called “block IMH”, can be done fully in parallel, which makes the whole thing pretty much costless compared to [...]

Read more »

Function for Scaled Difference Tests (chi-square and logliklihood values)

March 16, 2011
By
Function for Scaled Difference Tests (chi-square and logliklihood values)

#For use when comparing the fit of nested models with complex data, # (e.g. TYPE = COMPLEX is mplus) # The Scaled Difference Chi-square Test Statistic can be found at #http://preprints.stat.ucla.edu/260/chisquare.pdf # This function provides scaled differences tests based on … Continue reading →

Read more »

Ten rules for data analysis

Ten rules for data analysis

Peter Kennedy was an associate editor of the International Journal of Forecasting and a superb applied econometrician. He died unexpectedly in August 2010. He was best known for his excellent book _A Guide to Econometrics_ as well as his “Ten Com...

Read more »

Statistical tests for variable selection

Statistical tests for variable selection

I received an email today with the following comment: I’m using ARIMA with Intervention detection and was planning to use your package to identify my initial ARIMA model for later iteration, however I found that sometimes the auto.arima function ret...

Read more »

Upcoming talks and classes

March 11, 2011
By

Michael will be doing a fair amount of teaching and presenting over the next several weeks:March 16-18 Data Mining Techniques Theory and Practice at SAS Institute in Chicago.March 29 Applying Survival Analysis to Forecasting Subscriber Levels at the Ne...

Read more »

An enhanced Kaplan-Meier plot

March 8, 2011
By
An enhanced Kaplan-Meier plot

We often see, in publications, a Kaplan-Meier survival plot, with a table of the number of subjects at risk at different time points aligned below the figure. I needed this type of plot (or really, matrices of such plots) for an upcoming publication. Of course, my preferred toolbox was R and the ggplot2 package. There […]

Read more »

Cluster Silhouettes

March 4, 2011
By
Cluster Silhouettes

The book is done! All 822 pages of the third edition of Data Mining Techniques for Marketing, Sales, and Customer Relationship Management will be hitting bookstore shelves later this month or you can order it now. To celebrate, I am returning to the bl...

Read more »

RStudio: just what I’ve been looking for

RStudio: just what I’ve been looking for

For many years I used RWinEdt as my text editor for R code, but when WinEdt 6.0 came out, RWinEdt stopped working. So I’ve been looking for something to replace it. I’ve tried Tinn-R, NppToR, Eclipse with StatET and a couple of other editor...

Read more »

SMC^2 algorithm for state-space models

February 28, 2011
By
SMC^2 algorithm for state-space models

Hi again, With Nicolas Chopin and Omiros Papaspiliopoulos, we’ve just submitted a paper called SMC^2: a sequential Monte Carlo algorithm with particle Markov chain Monte Carlo updates. You can read the article online on arXiv here. This algorithm allows to estimate the parameters in a state space models, without any assumptions of linearity or gaussianity. [...]

Read more »

Collider Example

February 24, 2011
By
Collider Example

library (MASS) covar<-mvrnorm(250, c(0, 0), matrix(c(1, 0.00, 0.00, 1), 2, 2)) mydata<-data.frame(covar) names(mydata)<-c("sat", "mot") mydata$admin<- mydata$sat + mydata$mot mydata$admin2<- ifelse (mydata$admin >=quantile(mydata$admin, .85), "pass", "fail")   library(ggplot2) qplot(sat,mot, data = mydata, color = admin2) Created by Pretty R at inside-R.org

Read more »

The real data mining battle: Watson vs Google

February 17, 2011
By
The real data mining battle: Watson vs Google

Again, IBM showed us our limits. This time with the help of some fine data mining. But is Watson the only reasonable opponent of the human brain out there? How about Google? This article shows a spotlight of a battle between Watson and [...]

Read more »

Computing the vector norm

February 15, 2011
By

Update: a fast and stable norm was added to scipy.linalg in August 2011 and will be available in scipy 0.10 Last week I discussed with Gael how we should compute the euclidean norm of a vector a using SciPy. Two approaches suggest themselves, either ca...

Read more »

Smells like hacker spirit

February 11, 2011
By
Smells like hacker spirit

I was last weekend in FOSDEM presenting scikits.learn (here are the slides I used at the Data Analytics Devroom). Kudos to Olivier Grisel and all the people who organized such a fun and authentic meeting!

Read more »


Subscribe

Email:

  Subscribe