Blog Archives

Be careful evaluating model predictions

December 3, 2016
By
Be careful evaluating model predictions

One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score. This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter … Continue reading Be careful evaluating model predictions

Read more »

vtreat data cleaning and preparation article now available on arXiv

November 30, 2016
By

Nina Zumel and I are happy to announce a formal article discussing data preparation and cleaning using the vtreat methodology is now available from arXiv.org as citation arXiv:1611.09477 [stat.AP]. vtreat is an R data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares variables so that data has fewer … Continue reading vtreat data cleaning and preparation article now available on arXiv

Read more »

Teaching Practical Data Science with R

November 16, 2016
By
Teaching Practical Data Science with R

Practical Data Science with R, Zumel, Mount; Manning 2014 is a book Nina Zumel and I are very proud of. I have written before how I think this book stands out and why you should consider studying from it. Please read on for a some additional comments on the intent of different sections of the … Continue reading Teaching Practical Data Science with R

Read more »

You should re-encode high cardinality categorical variables

November 11, 2016
By
You should re-encode high cardinality categorical variables

Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. These are variables that essentially take on string-values (also called levels or factors) and vary through many such levels. Typical examples include zip-codes, vendor IDs, and product codes. In a sort … Continue reading You should re-encode high cardinality categorical variables

Read more »

Laplace noising versus simulated out of sample methods (cross frames)

November 9, 2016
By
Laplace noising versus simulated out of sample methods (cross frames)

Nina Zumel recently mentioned the use of Laplace noise in “count codes” by Misha Bilenko (see here and here) as a known method to break the overfit bias that comes from using the same data to design impact codes and fit a next level model. It is a fascinating method inspired by differential privacy methods, … Continue reading Laplace noising versus simulated out of sample methods (cross frames)

Read more »

Some vtreat design principles

November 1, 2016
By
Some vtreat design principles

We have already written quite a few times about our vtreat open source variable treatment package for R (which implements effects/impact coding, missing value replacement, and novel value replacement; among other important data preparation steps), but we thought we would take some time to describe some of the principles behind the package design. Introduction vtreat … Continue reading Some vtreat design principles

Read more »

A quick look at RStudio’s R notebooks

October 22, 2016
By

A quick demo of RStudio’s R Notebooks shown by John Mount (of Win-Vector LLC, a statistics, data science, and algorithms consulting and training firm). (link) It looks like some of the new in-line display behavior is back-ported to R Markdown and some of the difference is the delayed running and different level of interactivity in … Continue reading A quick look at RStudio’s R notebooks

Read more »

Data science for executives and managers

October 22, 2016
By

Nina Zumel recently announced upcoming speaking appearances. I want to promote the upcoming sessions at ODSC West 2016 (11:15am-1:00pm on Friday November 4th, or 3:00pm-4:30pm on Saturday November 5th) and invite executives, managers, and other data science consumers to attend. We assume most of the Win-Vector blog audience is made of practitioners (who we hope … Continue reading Data science for executives and managers

Read more »

The unfortunate one-sided logic of empirical hypothesis testing

October 17, 2016
By

I’ve been thinking a bit on statistical tests, their absence, abuse, and limits. I think much of the current “scientific replication crisis” stems from the fallacy that “failing to fail” is the same as success (in addition to the forces of bad luck, limited research budgets, statistical naiveté, sloppiness, pride, greed and other human qualities … Continue reading The unfortunate one-sided logic of empirical hypothesis testing

Read more »

On calculating AUC

October 7, 2016
By
On calculating AUC

Recently Microsoft Data Scientist Bob Horton wrote a very nice article on ROC plots. We expand on this a bit and discuss some of the issues in computing “area under the curve” (AUC). R has a number of ROC/AUC packages; for example ROCR, pROC, and plotROC. But it is instructive to see how ROC plots … Continue reading On calculating AUC

Read more »


Subscribe

Email:

  Subscribe