Posts Tagged ‘ Pragmatic Data Science ’

The case for index-free data manipulation

December 10, 2016
By
The case for index-free data manipulation

Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit … Continue reading The case for index-free data manipulation

Read more »

Using replyr::let to Parameterize dplyr Expressions

December 7, 2016
By
Using replyr::let to Parameterize dplyr Expressions

Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want … Continue reading Using replyr::let to Parameterize dplyr Expressions

Read more »

Be careful evaluating model predictions

December 3, 2016
By
Be careful evaluating model predictions

One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score. This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter … Continue reading Be careful evaluating model predictions

Read more »

vtreat data cleaning and preparation article now available on arXiv

November 30, 2016
By

Nina Zumel and I are happy to announce a formal article discussing data preparation and cleaning using the vtreat methodology is now available from arXiv.org as citation arXiv:1611.09477 [stat.AP]. vtreat is an R data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. It prepares variables so that data has fewer … Continue reading vtreat data cleaning and preparation article now available on arXiv

Read more »

Teaching Practical Data Science with R

November 16, 2016
By
Teaching Practical Data Science with R

Practical Data Science with R, Zumel, Mount; Manning 2014 is a book Nina Zumel and I are very proud of. I have written before how I think this book stands out and why you should consider studying from it. Please read on for a some additional comments on the intent of different sections of the … Continue reading Teaching Practical Data Science with R

Read more »

You should re-encode high cardinality categorical variables

November 11, 2016
By
You should re-encode high cardinality categorical variables

Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. These are variables that essentially take on string-values (also called levels or factors) and vary through many such levels. Typical examples include zip-codes, vendor IDs, and product codes. In a sort … Continue reading You should re-encode high cardinality categorical variables

Read more »

Some vtreat design principles

November 1, 2016
By
Some vtreat design principles

We have already written quite a few times about our vtreat open source variable treatment package for R (which implements effects/impact coding, missing value replacement, and novel value replacement; among other important data preparation steps), but we thought we would take some time to describe some of the principles behind the package design. Introduction vtreat … Continue reading Some vtreat design principles

Read more »

Data science for executives and managers

October 22, 2016
By

Nina Zumel recently announced upcoming speaking appearances. I want to promote the upcoming sessions at ODSC West 2016 (11:15am-1:00pm on Friday November 4th, or 3:00pm-4:30pm on Saturday November 5th) and invite executives, managers, and other data science consumers to attend. We assume most of the Win-Vector blog audience is made of practitioners (who we hope … Continue reading Data science for executives and managers

Read more »

On calculating AUC

October 7, 2016
By
On calculating AUC

Recently Microsoft Data Scientist Bob Horton wrote a very nice article on ROC plots. We expand on this a bit and discuss some of the issues in computing “area under the curve” (AUC). R has a number of ROC/AUC packages; for example ROCR, pROC, and plotROC. But it is instructive to see how ROC plots … Continue reading On calculating AUC

Read more »

Adding polished significance summaries to papers using R

October 4, 2016
By

When we teach “R for statistics” to groups of scientists (who tend to be quite well informed in statistics, and just need a bit of help with R) we take the time to re-work some tests of model quality with the appropriate significance tests. We organize the lesson in terms of a larger and more … Continue reading Adding polished significance summaries to papers using R

Read more »


Subscribe

Email:

  Subscribe