Blog Archives

Old tails: a crude power law fit on ebook sales

April 18, 2014
By
Old tails: a crude power law fit on ebook sales

We use R to take a very brief look at the distribution of e-book sales on Amazon.com. Recently Hugh Howey shared some eBook sales data spidered from Amazon.com: The 50k Report. The data is largely a single scrape of statistics about various anonymized books. Howey’s analysis tries to break sales down by declared category and […] Related posts: Sample size and power for rare events Living in A Lognormal World…

Read more »

Can a classifier that never says “yes” be useful?

March 8, 2014
By
Can a classifier that never says “yes” be useful?

Many data science projects and presentations are needlessly derailed by not having set shared business relevant quantitative expectations early on (for some advice see Setting expectations in data science projects). One of the most common issues is the common layman expectation of “perfect prediction” from classification projects. It is important to set expectations correctly so […] Related posts: Setting expectations in data science projects More on ROC/AUC On Being a…

Read more »

Some statistics about the book

March 4, 2014
By
Some statistics about the book

The release date for Zumel, Mount “Practical Data Science with R” is getting close. I thought I would share a few statistics about what goes into this kind of book. “Practical Data Science with R” started formal work in October of 2012. We had always felt the Win-Vector blog represented practice and research for such […] Related posts: On writing a technical book Book Review: Ensemble Methods in Data Mining…

Read more »

Drowning in insignificance

February 26, 2014
By
Drowning in insignificance

Some researchers (in both science and marketing) abuse a slavish view of p-values to try and falsely claim credibility. The incantation is: “we achieved p = x (with x ≤ 0.05) so you should trust our work.” This might be true if the published result had been performed as a single project (and not as […] Related posts: Bayesian and Frequentist Approaches: Ask the Right Question Worry about correctness and…

Read more »

One day discount on Practical Data Science with R

February 21, 2014
By
One day discount on Practical Data Science with R

Please forward and share this discount offer for our upcoming book. Manning Deal of the Day February 22: Half off Practical Data Science with R. Use code dotd022214au at www.manning.com/zumel/. Related posts: Data Science, Machine Learning, and Statis...

Read more »

The gap between data mining and predictive models

February 21, 2014
By
The gap between data mining and predictive models

The Facebook data science blog shared some fun data explorations this Valentine’s Day in Carlos Greg Diuk’s “The Formation of Love”. They are rightly receiving positive interest in and positive reviews of their work (for example Robinson Meyer’s Atlantic article). The finding is also a great opportunity to discuss the gap between cool data mining […] Related posts: A Demonstration of Data Mining Generalized linear models for predicting rates Data…

Read more »

Unprincipled Component Analysis

February 10, 2014
By
Unprincipled Component Analysis

As a data scientist I have seen variations of principal component analysis and factor analysis so often blindly misapplied and abused that I have come to think of the technique as unprincipled component analysis. PCA is a good technique often used to reduce sensitivity to overfitting. But this stated design intent leads many to (falsely) […] Related posts: Bad Bayes: an example of why you need hold-out testing Don’t use…

Read more »

Bad Bayes: an example of why you need hold-out testing

February 1, 2014
By
Bad Bayes: an example of why you need hold-out testing

We demonstrate a dataset that causes many good machine learning algorithms to horribly overfit. The example is designed to imitate a common situation found in predictive analytic natural language processing. In this type of application you are often building a model using many rare text features. The rare text features are often nearly unique k-grams […] Related posts: Don’t use correlation to track prediction performance Generalized linear models for predicting…

Read more »

Use standard deviation (not mad about MAD)

January 19, 2014
By
Use standard deviation (not mad about MAD)

Nassim Nicholas Taleb recently wrote an article advocating the abandonment of the use of standard deviation and advocating the use of mean absolute deviation. Mean absolute deviation is indeed an interesting and useful measure- but there is a reason that standard deviation is important even if you do not like it: it prefers models that […] Related posts: Don’t use correlation to track prediction performance What does a generalized linear…

Read more »

Generalized linear models for predicting rates

January 1, 2014
By
Generalized linear models for predicting rates

I often need to build a predictive model that estimates rates. The example of our age is: ad click through rates (how often a viewer clicks on an ad estimated as a function of the features of the ad and the viewer). Another timely example is estimating default rates of mortgages or credit cards. You […] Related posts: What does a generalized linear model do? The equivalence of logistic regression…

Read more »


Subscribe

Email:

  Subscribe