vtreat is a DataFrame processor/conditioner that prepares real-world data for supervised machine learning or predictive modeling in a statistically sound manner. vtreat takes an input DataFrame that has a specified column called “the outcome variable” (or “y”) that is the quantity to be predicted (and must not have missing values). Other input columns are possible … Continue reading What is vtreat?
Tag: R
Speaking at BARUG
We will be speaking at the Tuesday, September 3, 2019 BARUG. If you are in the Bay Area, please come see us. Nina Zumel & John Mount Practical Data Science with R Practical Data Science with R (Zumel and Mount) was one of the first, and most widely-read books on the practice of doing Data … Continue reading Speaking at BARUG
vtreat up on PyPi
I am excited to announce vtreat is now available for Python on PyPi, in addition for R on CRAN. vtreat is: A data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. vtreat prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. … Continue reading vtreat up on PyPi
Returning to Tides
Fred Viole shared a great “data only” R solution to the forecasting tides problem. The methodology comes from a finance perspective, and has some great associated notes and articles. This gives me a chance to comment on the odd relation between prediction and profit in finance. If there really was a trade-able item with low … Continue reading Returning to Tides
Lord Kelvin, Data Scientist
In 1876 A. Légé & Co., 20 Cross Street, Hatton Gardens, London completed the first “tide calculating machine” for William Thomson (later Lord Kelvin) (ref). Thomson’s (Lord Kelvin) First Tide Predicting Machine, 1876 The results were plotted on the paper cylinders, and one literally “turned the crank” to perform the calculations. The tide calculating machine … Continue reading Lord Kelvin, Data Scientist
R wins COPSS Award!
Hadley Wickham from RStudio has won the 2019 COPSS Award, which expresses a rather radical switch from the traditional recipient of this award in that this recognises his many contributions to the R language and in particular to RStudio. The full quote for the nomination is his “influential work in statistical computing, visualisation, graphics, and […]
Some Notes on GNU Licenses in R Packages
I was recently asked if Win-Vector LLC would move the R wrapr package from a GPL-3 license to an LGPL license. In the end I decided to move wrapr distribution to a “GPL-2 | GPL-3” license. This means the package is now available under both GPL-2 and GPL-3 licensing, allowing the user to pick which … Continue reading Some Notes on GNU Licenses in R Packages
A Comment on Data Science Integrated Development Environments
A point that differs from our experience struck us in the recent note: A development environment specifically tailored to the data science sector on the level of RStudio, for example, does not (yet) exist. “What’s the Best Statistical Software? A Comparison of R, Python, SAS, SPSS and STATA” Amit Ghosh Actually, Python has a large … Continue reading A Comment on Data Science Integrated Development Environments
A Kind Note That We Really Appreciate
The following really made my day. I tell every data scientist I know about vtreat and urge them to read the paper. Jason Wolosonovich Jason, thanks for your support and thank you so much for taking the time to say this (and for your permission to quote you on this). For those interested the R … Continue reading A Kind Note That We Really Appreciate
R Books Discount!
We, the community of Manning R and data science authors, have talked Manning into offering a catalog-wide 40% discount on all books. Please take a look at some great deals on some great technical books here: http://mng.bz/adRj !
Gibbs sampling with incompatible conditionals
An interesting question (with no clear motivation) on X validated wondering why a Gibbs sampler produces NAs… Interesting because multi-layered: The attached R code indeed produces NAs because it calls the Negative Binomial Neg(x¹,p) random generator with a zero success parameter, x¹=0, which automatically returns NAs. This can be escaped by returning a one (1) […]
Big News: Porting vtreat to Python
We at Win-Vector LLC have some big news. We are finally porting a streamlined version of our R vtreat variable preparation package to Python. vtreat is a great system for preparing messy data for suprevised machine learning. The new implementation is based on Pandas, and we are experimenting with pushing the sklearn.pipeline.Pipeline APIs to their … Continue reading Big News: Porting vtreat to Python
Big News: Porting vtreat to Python
We at Win-Vector LLC have some big news. We are finally porting a streamlined version of our R vtreat variable preparation package to Python. vtreat is a great system for preparing messy data for suprevised machine learning. The new implementation is based on Pandas, and we are experimenting with pushing the sklearn.pipeline.Pipeline APIs to their … Continue reading Big News: Porting vtreat to Python
An Ad-hoc Method for Calibrating Uncalibrated Models
In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized linear regression, when they make … Continue reading An Ad-hoc Method for Calibrating Uncalibrated Models
Is Scholarly Use of R Use Beating SPSS Already?
by Bob Muenchen & Sean Mackinnon One of us (Muenchen) has been tracking The Popularity of Data Science Software using a variety of different approaches. One approach is to use Google Scholar to count the number of scholarly articles found … Continue reading →
Some Details on Running xgboost
While reading Dr. Nina Zumel’s excellent note on bias in common ensemble methods, I ran the examples to see the effects she described (and I think it is very important that she is establishing the issue, prior to discussing mitigation). In doing that I ran into one more avoidable but strange issue in using xgboost: when … Continue reading Some Details on Running xgboost
a non-riddle
Unless I missed a point in the last riddle from the Riddler, there is very little to say about it: Given N ocre balls, N aquamarine balls, and two urns, what is the optimal way to allocate the balls to the urns towards drawing an ocre ball with no urn being empty? Both my reasoning […]
Common Ensemble Models can be Biased
In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved. However, when making predictions on individuals, a biased model may be preferable; biased models may … Continue reading Common Ensemble Models can be Biased
CRAN does not validate R packages!
A friend called me the other day for advice on how to submit an R package to CRAN along with a proof his method was mathematically sound. I replied with some items of advice taken from my (limited) experience with submitting packages. And with the remark that CRAN would not validate the mathematical contents of […]
Le Monde puzzle [#1105]
Another token game as Le Monde mathematical puzzle: Archibald and Beatrix play with a pile of n>100 tokens, sequentially picking m tokens from the pile with m being a prime number [including m=1] or a multiple of 6, the winner taking the last tokens. If Beatrix knows n and proposes to Archibald to start, what […]