Category: Pragmatic Machine Learning

WVPlots 1.1.2 on CRAN

I have put a new release of the WVPlots package up on CRAN. This release adds palette and/or color controls to most of the plotting functions in the package. WVPlots was originally a catch-all package of ggplot2 visualizations that we at Win-Vector tended to use repeatedly, and wanted to turn into “one-liners.” A consequence of … Continue reading WVPlots 1.1.2 on CRAN

New Getting Started with vtreat Documentation

Win Vector LLC‘s Dr. Nina Zumel has just released some new vtreat documentation. vtreat is a an all-in one step data preparation system that helps defend your machine learning algorithms from: Missing values Large cardinality categorical variables Novel levels from categorical variables I hoped she could get the Python vtreat documentation up to parity with … Continue reading New Getting Started with vtreat Documentation

Introducing data_algebra

This article introduces the data_algebra project: a data processing tool family available in R and Python. These tools are designed to transform data either in-memory or on remote databases. In particular we will discuss the Python implementation (also called data_algebra) and its relation to the mature R implementations (rquery and rqdatatable). Introduction Parts of the … Continue reading Introducing data_algebra

What is vtreat?

vtreat is a DataFrame processor/conditioner that prepares real-world data for supervised machine learning or predictive modeling in a statistically sound manner. vtreat takes an input DataFrame that has a specified column called “the outcome variable” (or “y”) that is the quantity to be predicted (and must not have missing values). Other input columns are possible … Continue reading What is vtreat?

Lord Kelvin, Data Scientist

In 1876 A. Légé & Co., 20 Cross Street, Hatton Gardens, London completed the first “tide calculating machine” for William Thomson (later Lord Kelvin) (ref). Thomson’s (Lord Kelvin) First Tide Predicting Machine, 1876 The results were plotted on the paper cylinders, and one literally “turned the crank” to perform the calculations. The tide calculating machine … Continue reading Lord Kelvin, Data Scientist

Free Video Lecture: Vectors for Programmers and Data Scientists

We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material. Please check the lectures out. Vectors for Programmers and Data … Continue reading Free Video Lecture: Vectors for Programmers and Data Scientists

Starting With Data Science: A Rigorous Hands-On Introduction to Data Science for Engineers

Starting With Data Science A rigorous hands-on introduction to data science for engineers. Win Vector LLC is now offering a 4 day on-site intensive data science course. The course targets engineers familiar with Python and introduces them to the basics of current data science practice. This is designed as an interactive in-person (not remote or … Continue reading Starting With Data Science: A Rigorous Hands-On Introduction to Data Science for Engineers

PDSwR2: New Chapters!

We have two new chapters of Practical Data Science with R, Second Edition online and available for review! The newly available chapters cover: Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and … Continue reading PDSwR2: New Chapters!

vtreat Variable Importance

vtreat‘s purpose is to produce pure numeric R data.frames that are ready for supervised predictive modeling (predicting a value from other values). By ready we mean: a purely numeric data frame with no missing values and a reasonable number of columns (missing-values re-encoded with indicators, and high-degree categorical re-encode by effects codes or impact codes). … Continue reading vtreat Variable Importance