Blog Archives

Relative error distributions, without the heavy tail theatrics

September 20, 2016
By
Relative error distributions, without the heavy tail theatrics

Nina Zumel prepared an excellent article on the consequences of working with relative error distributed quantities (such as wealth, income, sales, and many more) called “Living in A Lognormal World.” The article emphasizes that if you are dealing with such quantities you are already seeing effects of relative error distributions (so it isn’t an exotic … Continue reading Relative error distributions, without the heavy tail theatrics

Read more »

Adversarial machine learning

September 11, 2016
By

I just got back from a very good conference organized by startup.ml: Adversarial Machine Learning. Please read on for my to comments on part of one of the very good talks. Classic machine learning (especially as it is taught in classes) emphasizes a nice safe static environment where you are given some unchanging data and … Continue reading Adversarial machine learning

Read more »

Did she know we were writing a book?

September 3, 2016
By
Did she know we were writing a book?

Writing a book is a sacrifice. It takes a lot of time, represents a lot of missed opportunities, and does not (directly) pay very well. If you do a good job it may pay back in good-will, but producing a serious book is a great challenge. Nina Zumel and I definitely troubled over possibilities for … Continue reading Did she know we were writing a book?

Read more »

Variables can synergize, even in a linear model

September 1, 2016
By

Introduction Suppose we have the task of predicting an outcome y given a number of variables v1,..,vk. We often want to “prune variables” or build models with fewer than all the variables. This can be to speed up modeling, decrease the cost of producing future data, improve robustness, improve explain-ability, even reduce over-fit, and improve … Continue reading Variables can synergize, even in a linear model

Read more »

The R community is awesome (and fast)

August 30, 2016
By

Recently I whined/whinged or generally complained about a few sharp edges in some powerful R systems. In each case I was treated very politely, listened to, and actually got fixes back in a very short timeframe from volunteers. That is really great and probably one of the many reasons R is a great ecosystem. Please … Continue reading The R community is awesome (and fast)

Read more »

Variable pruning is NP hard

August 28, 2016
By

I am working on some practical articles on variable selection, especially in the context of step-wise linear regression and logistic regression. One thing I noticed while preparing some examples is that summaries such as model quality (especially out of sample quality) and variable significances are not quite as simple as one would hope (they in … Continue reading Variable pruning is NP hard

Read more »

vtreat 0.5.27 released on CRAN

August 19, 2016
By

Win-Vector LLC, Nina Zumel and I are pleased to announce that ‘vtreat’ version 0.5.27 has been released on CRAN. vtreat is a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. (from the package documentation) Very roughly vtreat accepts an arbitrary “from the wild” data frame (with different column types, … Continue reading vtreat 0.5.27 released on CRAN

Read more »

My criticism of R numeric summary

August 18, 2016
By
My criticism of R numeric summary

My criticism of R‘s numeric summary() method is: it is unfaithful to numeric arguments (due to bad default behavior) and frankly it should be considered unreliable. It is likely the way it is for historic and compatibility reasons, but in my opinion it does not currently represent a desirable set of tradeoffs. summary() likely represents … Continue reading My criticism of R numeric summary

Read more »

The Win-Vector parallel computing in R series

August 16, 2016
By

With our recent publication of “Can you nest parallel operations in R?” we now have a nice series of “how to speed up statistical computations in R” that moves from application, to larger/cloud application, and then to details. For your convenience here they are in order: A gentle introduction to parallel computing in R Running … Continue reading The Win-Vector parallel computing in R series

Read more »

On accuracy

July 22, 2016
By
On accuracy

In our last article on the algebra of classifier measures we encouraged readers to work through Nina Zumel’s original “Statistics to English Translation” series. This series has become slightly harder to find as we have use the original category designation “statistics to English translation” for additional work. To make things easier here are links to … Continue reading On accuracy

Read more »


Subscribe

Email:

  Subscribe