Blog Archives

Wanted: A Perfect Scatterplot (with Marginals)

June 12, 2015
By

We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki: The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Ma...

Read more »

Does Balancing Classes Improve Classifier Performance?

February 27, 2015
By
Does Balancing Classes Improve Classifier Performance?

It’s a folk theorem I sometimes hear from colleagues and clients: that you must balance the class prevalence before training a classifier. Certainly, I believe that classification tends to be easier when the classes are nearly balanced, especially when the class you are actually interested in is the rarer one. But I have always been … Continue reading Does Balancing Classes Improve Classifier Performance? → Related posts: Don’t use correlation…

Read more »

Random Test/Train Split is not Always Enough

January 5, 2015
By
Random Test/Train Split is not Always Enough

Most data science projects are well served by a random test/train split. In our book Practical Data Science with R we strongly advise preparing data and including enough variables so that data is exchangeable, and scoring classifiers using a random test/train split. With enough data and a big enough arsenal of methods, it’s relatively easy … Continue reading Random Test/Train Split is not Always Enough → Related posts: Does Balancing…

Read more »

The Geometry of Classifiers

December 19, 2014
By
The Geometry of Classifiers

As John mentioned in his last post, we have been quite interested in the recent study by Fernandez-Delgado, et.al., “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” (the “DWN study” for short), which evaluated 179 popular implementations of common classification algorithms over 120 or so data sets, mostly from the UCI … Continue reading The Geometry of Classifiers → Related posts: Does Balancing Classes Improve Classifier…

Read more »

Estimating Generalization Error with the PRESS statistic

September 25, 2014
By
Estimating Generalization Error with the PRESS statistic

As we’ve mentioned on previous occasions, one of the defining characteristics of data science is the emphasis on the availability of “large” data sets, which we define as “enough data that statistical efficiency is not a concern” (note that a “large” data set need not be “big data,” however you choose to define it). In … Continue reading Estimating Generalization Error with the PRESS statistic → Related posts: Don’t use…

Read more »

Vtreat: designing a package for variable treatment

August 8, 2014
By
Vtreat: designing a package for variable treatment

When you apply machine learning algorithms on a regular basis, on a wide variety of data sets, you find that certain data issues come up again and again: Missing values (NA or blanks) Problematic numerical values (Inf, NaN, sentinel values like 999999999 or -1) Valid categorical levels that don’t appear in the training data (especially … Continue reading Vtreat: designing a package for variable treatment → Related posts: R minitip:…

Read more »

Trimming the Fat from glm() Models in R

May 30, 2014
By
Trimming the Fat from glm() Models in R

One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, glm models are not so concise; we noticed this to our dismay when we tried to […] Related posts: Generalized linear models for predicting rates Bad Bayes: an example of why…

Read more »

Bandit Formulations for A/B Tests: Some Intuition

April 24, 2014
By
Bandit Formulations for A/B Tests: Some Intuition

Controlled experiments embody the best scientific design for establishing a causal relationship between changes and their influence on user-observable behavior. – Kohavi, Henne, Sommerfeld, “Practical Guide to Controlled Experiments on the Web” (2007) A/B tests are one of the simplest ways of running controlled experiments to evaluate the efficacy of a proposed improvement (a new […] Related posts: Unit tests as penance How to test XCOM “dice rolls” for fairness…

Read more »

Practical Data Science with R: Release date announced

March 26, 2014
By
Practical Data Science with R: Release date announced

It took a little longer than we’d hoped, but we did it! Practical Data Science with R will be released on April 2nd (physical version). The eBook version will follow soon after, on April 15th. You can preorder the pBook now on the Manning book page. The physical version comes with a complimentary eBook version […] Related posts: Data Science, Machine Learning, and Statistics: what is in a name? Setting…

Read more »

The Statistics behind “Verification by Multiplicity”

March 2, 2014
By
The Statistics behind “Verification by Multiplicity”

There’s a new post up at the ninazumel.com blog that looks at the statistics of “verification by multiplicity” — the statistical technique that is behind NASA’s announcement of 715 new planets that have been validated in the data from the Kepler Space Telescope. We normally don’t write about science here at Win-Vector, but we do […] Related posts: “I don’t think that means what you think it means;” Statistics to…

Read more »


Subscribe

Email:

  Subscribe