Nina Zumel finished some great new documentation showing how to use Python vtreat to prepare data for multinomial classification mode. And I have finally finished porting the documentation to R vtreat. So we now have good introductions on how to use vtreat to prepare data for the common tasks of: Regression: R regression example, Python … Continue reading New vtreat Documentation (Starting with Multinomial Classification)
Category: Practical Data Science
Why Do We Plot Predictions on the x-axis?
When studying regression models, One of the first diagnostic plots most students learn is to plot residuals versus the model’s predictions (that is, with the predictions on the x-axis). Here’s a basic example. # build an “ideal” linear process. set.seed(34524) N = 100 x1 = runif(N) x2 = runif(N) noise = 0.25*rnorm(N) y = x1 … Continue reading Why Do We Plot Predictions on the x-axis?
How to Prepare Data
Real world data can present a number of challenges to data science workflows. Even properly structured data (each interesting measurement already landed in distinct columns), can present problems, such as missing values and high cardinality categorical variables. In this note we describe some great tools for working with such data. For an example: consider the … Continue reading How to Prepare Data
Preparing Data for Supervised Classification
Nina Zumel has been polishing up new vtreat for Python documentation and tutorials. They are coming out so good that I find to be fair to the R community I must start to back-port this new documentation to vtreat for R. vtreat is a package for systematically preparing data for supervised machine learning tasks such … Continue reading Preparing Data for Supervised Classification
The Advantages of Record Transform Specifications
Nina Zumel had a really great article on how to prepare a nice Keras performance plot using R.
I will use this example to show some of the advantages of cdata record transform specifications.
The model performance data from Keras is in the following…
Practical Data Science with R update
Just got the following note from a new reader: Thank you for writing Practical Data Science with R. It’s challenging for me, but I am learning a lot by following your steps and entering the commands. Wow, this is exactly what Nina Zumel and I hoped for. We wish we could make everything easy, but … Continue reading Practical Data Science with R update
WVPlots 1.1.2 on CRAN
I have put a new release of the WVPlots package up on CRAN. This release adds palette and/or color controls to most of the plotting functions in the package. WVPlots was originally a catch-all package of ggplot2 visualizations that we at Win-Vector tended to use repeatedly, and wanted to turn into “one-liners.” A consequence of … Continue reading WVPlots 1.1.2 on CRAN
A Kind Note That We Really Appreciate
The following really made my day. I tell every data scientist I know about vtreat and urge them to read the paper. Jason Wolosonovich Jason, thanks for your support and thank you so much for taking the time to say this (and for your permission to quote you on this). For those interested the R … Continue reading A Kind Note That We Really Appreciate
Big News: Porting vtreat to Python
We at Win-Vector LLC have some big news. We are finally porting a streamlined version of our R vtreat variable preparation package to Python. vtreat is a great system for preparing messy data for suprevised machine learning. The new implementation is based on Pandas, and we are experimenting with pushing the sklearn.pipeline.Pipeline APIs to their … Continue reading Big News: Porting vtreat to Python
Big News: Porting vtreat to Python
We at Win-Vector LLC have some big news. We are finally porting a streamlined version of our R vtreat variable preparation package to Python. vtreat is a great system for preparing messy data for suprevised machine learning. The new implementation is based on Pandas, and we are experimenting with pushing the sklearn.pipeline.Pipeline APIs to their … Continue reading Big News: Porting vtreat to Python
An Ad-hoc Method for Calibrating Uncalibrated Models
In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized linear regression, when they make … Continue reading An Ad-hoc Method for Calibrating Uncalibrated Models
Some Details on Running xgboost
While reading Dr. Nina Zumel’s excellent note on bias in common ensemble methods, I ran the examples to see the effects she described (and I think it is very important that she is establishing the issue, prior to discussing mitigation). In doing that I ran into one more avoidable but strange issue in using xgboost: when … Continue reading Some Details on Running xgboost
Common Ensemble Models can be Biased
In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved. However, when making predictions on individuals, a biased model may be preferable; biased models may … Continue reading Common Ensemble Models can be Biased
Link Functions versus Data Transforms
In the linear regression section of our book Practical Data Science in R, we use the example of predicting income from a number of demographic variables (age, sex, education and employment type). In the text, we choose to regress against log10(income) rather than directly against income. One obvious reason for not regressing directly against income … Continue reading Link Functions versus Data Transforms
Link Functions versus Data Transforms
In the linear regression section of our book Practical Data Science in R, we use the example of predicting income from a number of demographic variables (age, sex, education and employment type). In the text, we choose to regress against log10(income) rather than directly against income. One obvious reason for not regressing directly against income … Continue reading Link Functions versus Data Transforms
Cohen’s D for Experimental Planning
In this note, we discuss the use of Cohen’s D for planning difference-of-mean experiments. Estimating sample size Let’s imagine you are testing a new weight loss program and comparing it so some existing weight loss regimen. You want to run an experiment to determine if the new program is more effective than the old one. … Continue reading Cohen’s D for Experimental Planning
Free Video Lecture: Vectors for Programmers and Data Scientists
We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material. Please check the lectures out. Vectors for Programmers and Data … Continue reading Free Video Lecture: Vectors for Programmers and Data Scientists
Could not Resist
Also, Practical Data Science with R, 2nd Edition; Zumel, Mount; Manning 2019 is now content complete! It is deep into editing and soon into production!
Data Layout Exercises
John Mount, Nina Zumel; Win-Vector LLC 2019-04-27 In this note we will use five real life examples to demonstrate data layout transforms using the cdata R package. The examples for this note are all demo-examples from tidyr/demo/, and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, … Continue reading Data Layout Exercises
Practical Data Science with R Book Update (April 2019)
I thought I would give a personal update on our book: Practical Data Science with R 2nd edition; Zumel, Mount; Manning 2019. The second edition should be fully available this fall! Nina and I have finished up through chapter 10 (of 12), and Manning has released previews of up through chapter 7 (with more to … Continue reading Practical Data Science with R Book Update (April 2019)