Posts Tagged ‘ Tutorials ’

Be careful evaluating model predictions

December 3, 2016
By
Be careful evaluating model predictions

One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score. This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter … Continue reading Be careful evaluating model predictions

Read more »

A quick look at RStudio’s R notebooks

October 22, 2016
By

A quick demo of RStudio’s R Notebooks shown by John Mount (of Win-Vector LLC, a statistics, data science, and algorithms consulting and training firm). (link) It looks like some of the new in-line display behavior is back-ported to R Markdown and some of the difference is the delayed running and different level of interactivity in … Continue reading A quick look at RStudio’s R notebooks

Read more »

Data science for executives and managers

October 22, 2016
By

Nina Zumel recently announced upcoming speaking appearances. I want to promote the upcoming sessions at ODSC West 2016 (11:15am-1:00pm on Friday November 4th, or 3:00pm-4:30pm on Saturday November 5th) and invite executives, managers, and other data science consumers to attend. We assume most of the Win-Vector blog audience is made of practitioners (who we hope … Continue reading Data science for executives and managers

Read more »

Upcoming Talks

October 17, 2016
By

I (Nina Zumel) will be speaking at the Women who Code Silicon Valley meetup on Thursday, October 27. The talk is called Improving Prediction using Nested Models and Simulated Out-of-Sample Data. In this talk I will discuss nested predictive models. These are models that predict an outcome or dependent variable (called y) using additional submodels … Continue reading Upcoming Talks

Read more »

The unfortunate one-sided logic of empirical hypothesis testing

October 17, 2016
By

I’ve been thinking a bit on statistical tests, their absence, abuse, and limits. I think much of the current “scientific replication crisis” stems from the fallacy that “failing to fail” is the same as success (in addition to the forces of bad luck, limited research budgets, statistical naiveté, sloppiness, pride, greed and other human qualities … Continue reading The unfortunate one-sided logic of empirical hypothesis testing

Read more »

On calculating AUC

October 7, 2016
By
On calculating AUC

Recently Microsoft Data Scientist Bob Horton wrote a very nice article on ROC plots. We expand on this a bit and discuss some of the issues in computing “area under the curve” (AUC). R has a number of ROC/AUC packages; for example ROCR, pROC, and plotROC. But it is instructive to see how ROC plots … Continue reading On calculating AUC

Read more »

Adding polished significance summaries to papers using R

October 4, 2016
By

When we teach “R for statistics” to groups of scientists (who tend to be quite well informed in statistics, and just need a bit of help with R) we take the time to re-work some tests of model quality with the appropriate significance tests. We organize the lesson in terms of a larger and more … Continue reading Adding polished significance summaries to papers using R

Read more »

Relative error distributions, without the heavy tail theatrics

September 20, 2016
By
Relative error distributions, without the heavy tail theatrics

Nina Zumel prepared an excellent article on the consequences of working with relative error distributed quantities (such as wealth, income, sales, and many more) called “Living in A Lognormal World.” The article emphasizes that if you are dealing with such quantities you are already seeing effects of relative error distributions (so it isn’t an exotic … Continue reading Relative error distributions, without the heavy tail theatrics

Read more »

Variables can synergize, even in a linear model

September 1, 2016
By

Introduction Suppose we have the task of predicting an outcome y given a number of variables v1,..,vk. We often want to “prune variables” or build models with fewer than all the variables. This can be to speed up modeling, decrease the cost of producing future data, improve robustness, improve explain-ability, even reduce over-fit, and improve … Continue reading Variables can synergize, even in a linear model

Read more »

Variable pruning is NP hard

August 28, 2016
By

I am working on some practical articles on variable selection, especially in the context of step-wise linear regression and logistic regression. One thing I noticed while preparing some examples is that summaries such as model quality (especially out of sample quality) and variable significances are not quite as simple as one would hope (they in … Continue reading Variable pruning is NP hard

Read more »


Subscribe

Email:

  Subscribe