Blog Archives

Why do Decision Trees Work?

January 6, 2017
By
Why do Decision Trees Work?

In this article we will discuss the machine learning method called “decision trees”, moving quickly over the usual “how decision trees work” and spending time on “why decision trees work.” We will write from a computational learning theory perspective, and hope this helps make both decision trees and computational learning theory more comprehensible. The goal … Continue reading Why do Decision Trees Work?

Read more »

A Theory of Nested Cross Simulation

January 2, 2017
By
A Theory of Nested Cross Simulation

[Reader’s Note. Some of our articles are applied and some of our articles are more theoretical. The following article is more theoretical, and requires fairly formal notation to even work through. However, it should be of interest as it touches on some of the fine points of cross-validation that are quite hard to perceive or … Continue reading A Theory of Nested Cross Simulation

Read more »

Data Preparation, Long Form and tl;dr Form

December 26, 2016
By
Data Preparation, Long Form and tl;dr Form

Data preparation and cleaning are some of the most important steps of predictive analytic and data science tasks. They are laborious, where most of the errors are made, your last line of defense against a wild data, and hold the biggest opportunities for outcome improvement. No matter how much time you spend on them, they … Continue reading Data Preparation, Long Form and tl;dr Form

Read more »

Does replyr::let work with data.table?

December 24, 2016
By
Does replyr::let work with data.table?

I’ve been asked if the adapter “let” from our R package replyr works with data.table. My answer is: it does work. I am not a data.table user so I am not the one to ask if data.table benefits a from a non-standard evaluation to standard evaluation adapter such as replyr::let. Using replyr::let with data.table looks … Continue reading Does replyr::let work with data.table?

Read more »

Comparative examples using replyr::let

December 22, 2016
By
Comparative examples using replyr::let

Consider the problem of “parametric programming” in R. That is: simply writing correct code before knowing some details, such as the names of the columns your procedure will have to be applied to in the future. Our latest version of replyr::let makes such programming easier. Archie’s Mechanics #2 (1954) copyright Archie Publications (edit: great news! … Continue reading Comparative examples using replyr::let

Read more »

help(let, package=’replyr’)

December 17, 2016
By

A bit more on the let wrapper from our replyr R package. library("replyr") help(let, package="replyr") (Edit: this has been updated to the `0.2.0` version of `replyr` which eliminates some of the `()` notation). let {replyr} R Documentation Execute expr with name substitutions specified in alias. Description let implements a mapping from desired names (names used … Continue reading help(let, package=’replyr’)

Read more »

Organize your data manipulation in terms of “grouped ordered apply”

December 15, 2016
By
Organize your data manipulation in terms of “grouped ordered apply”

Consider the common following problem: compute for a data set (say the infamous iris example data set) per-group ranks. Suppose we want the rank of iris Sepal.Lengths on a per-Species basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is … Continue reading Organize your data manipulation in terms of “grouped ordered apply”

Read more »

The Case For Using -> In R

December 13, 2016
By
The Case For Using -> In R

R has a number of assignment operators (at least “<-“, “=“, and “->“; plus “<<-” and “->>” which have different semantics). The R-style guides routinely insist on “<-” as being the only preferred form. In this note we are going to try to make the case for “->” when using magrittr pipelines. [edit: After reading … Continue reading The Case For Using -> In R

Read more »

The case for index-free data manipulation

December 10, 2016
By
The case for index-free data manipulation

Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit … Continue reading The case for index-free data manipulation

Read more »

Be careful evaluating model predictions

December 3, 2016
By
Be careful evaluating model predictions

One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score. This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter … Continue reading Be careful evaluating model predictions

Read more »


Subscribe

Email:

  Subscribe