Blog Archives

Finding the K in K-means by Parametric Bootstrap

February 9, 2016
By
Finding the K in K-means by Parametric Bootstrap

One of the trickier tasks in clustering is determining the appropriate number of clusters. Domain-specific knowledge is always best, when you have it, but there are a number of heuristics for getting at the likely number of clusters in your data. We cover a few of them in Chapter 8 (available as a free sample … Continue reading Finding the K in K-means by Parametric Bootstrap

Read more »

Upcoming Win-Vector Appearances

November 9, 2015
By

We have two public appearances coming up in the next few weeks: Workshop at ODSC, San Francisco – November 14 Both of us will be giving a two-hour workshop called Preparing Data for Analysis using R: Basic through Advanced Techniques. We will cover key issues in this important but often neglected aspect of data science, … Continue reading Upcoming Win-Vector Appearances

Read more »

Our Differential Privacy Mini-series

November 2, 2015
By
Our Differential Privacy Mini-series

We’ve just finished off a series of articles on some recent research results applying differential privacy to improve machine learning. Some of these results are pretty technical, so we thought it was worth working through concrete examples. And some of the original results are locked behind academic journal paywalls, so we’ve tried to touch on … Continue reading Our Differential Privacy Mini-series

Read more »

A Simpler Explanation of Differential Privacy

October 2, 2015
By
A Simpler Explanation of Differential Privacy

Differential privacy was originally developed to facilitate secure analysis over sensitive data, with mixed success. It’s back in the news again now, with exciting results from Cynthia Dwork, et. al. (see references at the end of the article) that apply results from differential privacy to machine learning. In this article we’ll work through the definition … Continue reading A Simpler Explanation of Differential Privacy

Read more »

How do you know if your model is going to work?

September 22, 2015
By
How do you know if your model is going to work?

Authors: John Mount (more articles) and Nina Zumel (more articles). Our four part article series collected into one piece. Part 1: The problem Part 2: In-training set measures Part 3: Out of sample procedures Part 4: Cross-validation techniques “Essentially, all models are wrong, but some are useful.” George Box Here’s a caricature of a data … Continue reading How do you know if your model is going to work?

Read more »

Bootstrap Evaluation of Clusters

September 4, 2015
By
Bootstrap Evaluation of Clusters

Illustration from Project Gutenberg The goal of cluster analysis is to group the observations in the data into clusters such that every datum in a cluster is more similar to other datums in the same cluster than it is to datums in other clusters. This is an analysis method of choice when annotated training data … Continue reading Bootstrap Evaluation of Clusters

Read more »

How Do You Know if Your Data Has Signal?

August 10, 2015
By
How Do You Know if Your Data Has Signal?

Image by Liz Sullivan, Creative Commons. Source: Wikimedia An all too common approach to modeling in data science is to throw all possible variables at a modeling procedure and “let the algorithm sort it out.” This is tempting when you are not sure what are the true causes or predictors of the phenomenon you are … Continue reading How Do You Know if Your Data Has Signal?

Read more »

Working with Sessionized Data 2: Variable Selection

July 15, 2015
By
Working with Sessionized Data 2:  Variable Selection

In our previous post in this series, we introduced sessionization, or converting log data into a form that’s suitable for analysis. We looked at basic considerations, like dealing with time, choosing an appropriate dataset for training models, and choosing appropriate (and achievable) business goals. In that previous example, we sessionized the data by considering all … Continue reading Working with Sessionized Data 2: Variable Selection →

Read more »

Working with Sessionized Data 1: Evaluating Hazard Models

July 8, 2015
By
Working with Sessionized Data 1: Evaluating Hazard Models

When we teach data science we emphasize the data scientist’s responsibility to transform available data from multiple systems of record into a wide or denormalized form. In such a “ready to analyze” form each individual example gets a row of data and every fact about the example is a column. Usually transforming data into this … Continue reading Working with Sessionized Data 1: Evaluating Hazard Models →

Read more »

Wanted: A Perfect Scatterplot (with Marginals)

June 12, 2015
By

We saw this scatterplot with marginal densities the other day, in a blog post by Thomas Wiecki: The graph was produced in Python, using the seaborn package. Seaborn calls it a “jointplot;” it’s called a “scatterhist” in Ma...

Read more »


Subscribe

Email:

  Subscribe