Blog Archives

Big News! “Practical Data Science with R” MEAP launched!

May 15, 2013
By
Big News! “Practical Data Science with R” MEAP launched!

Nina Zumel and I ( John Mount ) have been working very hard on producing an exciting new book called “Practical Data Science with R.” The book has now entered Manning Early Access Program (MEAP) which allows you to subscribe to chapters as they become available and give us feedback before the book goes into [...] Related posts: Setting expectations in data science projects Data Science, Machine Learning, and Statistics:…

Read more »

A pathological glm() problem that doesn’t issue a warning

May 1, 2013
By
A pathological glm() problem that doesn’t issue a warning

I know I have already written a lot about technicalities in logistic regression (see for example: How robust is logistic regression? and Newton-Raphson can compute an average). But I just ran into a simple case where R‘s glm() implementation of logistic regression seems to fail without issuing a warning message. Yes the data is a [...] Related posts: Newton-Raphson can compute an average How robust is logistic regression? What does…

Read more »

Data Science, Machine Learning, and Statistics: what is in a name?

April 19, 2013
By
Data Science, Machine Learning, and Statistics: what is in a name?

A fair complaint when seeing yet another “data science” article is to say: “this is just medical statistics” or “this is already part of bioinformatics.” We certainly label many articles as “data science” on this blog. Probably the complaint is slightly cleaner if phrased as “this is already known statistics.” But the essence of the [...] Related posts: A Personal Perspective on Machine Learning Setting expectations in data science projects…

Read more »

Checking claims in published statistics papers

April 8, 2013
By
Checking claims in published statistics papers

When finishing Worry about correctness and repeatability, not p-values I got to thinking a bit more about what can you actually check when reading a paper, especially when you don’t have access to the raw data. Some of the fellow scientists I admire most have a knack for back of the envelope calculations and dimensional [...] Related posts: Worry about correctness and repeatability, not p-values Statistics to English Translation, Part…

Read more »

Worry about correctness and repeatability, not p-values

April 5, 2013
By
Worry about correctness and repeatability, not p-values

In data science work you often run into cryptic sentences like the following: Age adjusted death rates per 10,000 person years across incremental thirds of muscular strength were 38.9, 25.9, and 26.6 for all causes; 12.1, 7.6, and 6.6 for cardiovascular disease; and 6.1, 4.9, and 4.2 for cancer (all P < 0.01 for linear [...] Related posts: Level fit summaries can be tricky in R How to test XCOM…

Read more »

A bit more on sample size

March 8, 2013
By
A bit more on sample size

In our article What is a large enough random sample? we pointed out that if you wanted to measure a proportion to an accuracy “a” with chance of being wrong of “d” then a idea was to guarantee you had a sample size of at least: This is the central question in designing opinion polls [...] Related posts: What is a large enough random sample? Level fit summaries can be…

Read more »

Don’t use correlation to track prediction performance

February 22, 2013
By
Don’t use correlation to track prediction performance

Using correlation to track model performance is “a mistake that nobody would ever make” combined with a vague “what would be wrong if I did do that” feeling. I hope after reading this feel a least a small urge to double check your work and presentations to make sure you have not reported correlation where [...] Related posts: Correlation and R-Squared Level fit summaries can be tricky in R Modeling…

Read more »

More on ROC/AUC

January 18, 2013
By
More on ROC/AUC

A bit more on the ROC/AUC The receiver operating characteristic curve (or ROC) is one of the standard methods to evaluate a scoring system. Nina Zumel has described its application, but we would like to emphasize out some additional details. In my opinion while the ROC is a useful tool, the “area under the curve” [...] Related posts: “I don’t think that means what you think it means;” Statistics to…

Read more »

How to test XCOM “dice rolls” for fairness

December 11, 2012
By
How to test XCOM “dice rolls” for fairness

XCOM: Enemy Unknown is a turn based video game where the player choses among actions (for example shooting an alien) that are labeled with a declared probability of success. Image copyright Firaxis Games A lot of gamers, after missing a 80% chance of success shot, start asking if the game’s pseudo random number generator is [...] Related posts: Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’ Statistics…

Read more »

Please stop using Excel-like formats to exchange data

December 7, 2012
By
Please stop using Excel-like formats to exchange data

I know “officially” data scientists all always work in “big data” environments with data in a remote database, streaming store or key-value system. But in day to day work Excel files and Excel export files get used a lot and cause a disproportionate amount of pain. I would like to make a plea to my [...] Related posts: Large Data Logistic Regression (with example Hadoop code) Added worked example to…

Read more »

Subscribe

Email:

  Subscribe