Category: data science

Data Layout Exercises

John Mount, Nina Zumel; Win-Vector LLC 2019-04-27 In this note we will use five real life examples to demonstrate data layout transforms using the cdata R package. The examples for this note are all demo-examples from tidyr/demo/, and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, … Continue reading Data Layout Exercises

Starting With Data Science: A Rigorous Hands-On Introduction to Data Science for Engineers

Starting With Data Science A rigorous hands-on introduction to data science for engineers. Win Vector LLC is now offering a 4 day on-site intensive data science course. The course targets engineers familiar with Python and introduces them to the basics of current data science practice. This is designed as an interactive in-person (not remote or … Continue reading Starting With Data Science: A Rigorous Hands-On Introduction to Data Science for Engineers

rquery Substitution

The rquery R package has several places where the user can ask for what they have typed in to be substituted for a name or value stored in a variable. This becomes important as many of the rquery commands capture column names from un-executed code. So knowing if something is treated as a symbol/name (which … Continue reading rquery Substitution

Query Generation in R

R users have been enjoying the benefits of SQL query generators for quite some time, most notably using the dbplyr package. I would like to talk about some features of our own rquery query generator, concentrating on derived result re-use. Introduction SQL represents value use by nesting. To use a query result within another query … Continue reading Query Generation in R

PDSwR2: New Chapters!

We have two new chapters of Practical Data Science with R, Second Edition online and available for review! The newly available chapters cover: Data Engineering And Data Shaping – Explores how to use R to organize or wrangle data into a shape useful for analysis. The chapter covers applying data transforms, data manipulation packages, and … Continue reading PDSwR2: New Chapters!

Function Objects and Pipelines in R

Composing functions and sequencing operations are core programming concepts. Some notable realizations of sequencing or pipelining operations include: Unix’s |-pipe CMS Pipelines. F#‘s forward pipe operator |>. Haskel’s Data.Function & operator. The R magrittr forward pipe. Scikit-learn‘s sklearn.pipeline.Pipeline. The idea is: many important calculations can be considered as a sequence of transforms applied to a … Continue reading Function Objects and Pipelines in R

vtreat Variable Importance

vtreat‘s purpose is to produce pure numeric R data.frames that are ready for supervised predictive modeling (predicting a value from other values). By ready we mean: a purely numeric data frame with no missing values and a reasonable number of columns (missing-values re-encoded with indicators, and high-degree categorical re-encode by effects codes or impact codes). … Continue reading vtreat Variable Importance