Blog Archives

Teaching pivot / un-pivot

April 11, 2017
By
Teaching pivot / un-pivot

Authors: John Mount and Nina Zumel Introduction In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot. One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy to … Continue reading Teaching pivot / un-pivot

Read more »

A Simple Example of Using replyr::gapply

December 19, 2016
By
A Simple Example of Using replyr::gapply

It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns measurement and process_that_produced_measurement. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, for comparison. Such a work … Continue reading A Simple Example of Using replyr::gapply

Read more »

A Simple Example of Using replyr::gapply

December 19, 2016
By
A Simple Example of Using replyr::gapply

It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns measurement and process_that_produced_measurement. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, for comparison. Such a work … Continue reading A Simple Example of Using replyr::gapply

Read more »

Using replyr::let to Parameterize dplyr Expressions

December 7, 2016
By
Using replyr::let to Parameterize dplyr Expressions

Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want … Continue reading Using replyr::let to Parameterize dplyr Expressions

Read more »

Upcoming Talks

October 17, 2016
By

I (Nina Zumel) will be speaking at the Women who Code Silicon Valley meetup on Thursday, October 27. The talk is called Improving Prediction using Nested Models and Simulated Out-of-Sample Data. In this talk I will discuss nested predictive models. These are models that predict an outcome or dependent variable (called y) using additional submodels … Continue reading Upcoming Talks

Read more »

Principal Components Regression, Pt. 3: Picking the Number of Components

May 30, 2016
By
Principal Components Regression, Pt. 3: Picking the Number of Components

In our previous note we demonstrated Y-Aware PCA and other y-aware approaches to dimensionality reduction in a predictive modeling context, specifically Principal Components Regression (PCR). For our examples, we selected the appropriate number of principal components by eye. In this note, we will look at ways to select the appropriate number of principal components in … Continue reading Principal Components Regression, Pt. 3: Picking the Number of Components

Read more »

Principal Components Regression, Pt. 2: Y-Aware Methods

May 23, 2016
By
Principal Components Regression, Pt. 2: Y-Aware Methods

In our previous note, we discussed some problems that can arise when using standard principal components analysis (specifically, principal components regression) to model the relationship between independent (x) and dependent (y) variables. In this note, we present some dimensionality reduction techniques that alleviate some of those problems, in particular what we call Y-Aware Principal Components … Continue reading Principal Components Regression, Pt. 2: Y-Aware Methods

Read more »

Principal Components Regression, Pt.1: The Standard Method

May 17, 2016
By
Principal Components Regression, Pt.1: The Standard Method

In this note, we discuss principal components regression and some of the issues with it: The need for scaling. The need for pruning. The lack of “y-awareness” of the standard dimensionality reduction step. The purpose of this article is to set the stage for presenting dimensionality reduction techniques appropriate for predictive modeling, such as y-aware … Continue reading Principal Components Regression, Pt.1: The Standard Method

Read more »

Finding the K in K-means by Parametric Bootstrap

February 9, 2016
By
Finding the K in K-means by Parametric Bootstrap

One of the trickier tasks in clustering is determining the appropriate number of clusters. Domain-specific knowledge is always best, when you have it, but there are a number of heuristics for getting at the likely number of clusters in your data. We cover a few of them in Chapter 8 (available as a free sample … Continue reading Finding the K in K-means by Parametric Bootstrap

Read more »

Upcoming Win-Vector Appearances

November 9, 2015
By

We have two public appearances coming up in the next few weeks: Workshop at ODSC, San Francisco – November 14 Both of us will be giving a two-hour workshop called Preparing Data for Analysis using R: Basic through Advanced Techniques. We will cover key issues in this important but often neglected aspect of data science, … Continue reading Upcoming Win-Vector Appearances

Read more »


Subscribe

Email:

  Subscribe