Blog Archives

Managing Spark data handles in R

May 26, 2017
By
Managing Spark data handles in R

When working with big data with R (say, using Spark and sparklyr) we have found it very convenient to keep data handles in a neat list or data_frame. Please read on for our handy hints on keeping your data handles neat. When using R to work over a big data system (such as Spark) much … Continue reading Managing Spark data handles in R

Read more »

New series: R and big data (concentrating on Spark and sparklyr)

May 20, 2017
By
New series: R and big data (concentrating on Spark and sparklyr)

Win-Vector LLC has recently been teaching how to use R with big data through Spark and sparklyr. We have also been helping clients become productive on R/Spark infrastructure through direct consulting and bespoke training. I thought this would be a good time to talk about the power of working with big-data using R, share some … Continue reading New series: R and big data (concentrating on Spark and sparklyr)

Read more »

dplyr in Context

May 7, 2017
By
dplyr in Context

Introduction Beginning R users often come to the false impression that the popular packages dplyr and tidyr are both all of R and sui generis inventions (in that they might be unprecedented and there might no other reasonable way to get the same effects in R). These packages and their conventions are high-value, but they … Continue reading dplyr in Context

Read more »

Why to use wrapr::let()

May 2, 2017
By
Why to use wrapr::let()

I have written about referential transparency before. In this article I would like to discuss “leaky abstractions” and why wrapr::let() supplies a useful (but leaky) abstraction for R programmers. Abstractions A common definition of an abstraction is (from the OSX dictionary): the process of considering something independently of its associations, attributes, or concrete accompaniments. In … Continue reading Why to use wrapr::let()

Read more »

Encoding categorical variables: one-hot and beyond

April 15, 2017
By
Encoding categorical variables: one-hot and beyond

(or: how to correctly use xgboost from R) R has "one-hot" encoding hidden in most of its modeling paths. Asking an R user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere. For example we can see evidence of one-hot encoding … Continue reading Encoding categorical variables: one-hot and beyond

Read more »

You can’t do that in statistics

April 6, 2017
By
You can’t do that in statistics

There are a number of statistical principles that are perhaps more honored in the breach than in the observance. For fun I am going to name a few, and show why they are not always the “precision surgical knives of thought” one would hope for (working more like large hammers). The litany of complaints A … Continue reading You can’t do that in statistics

Read more »

Coordinatized Data: A Fluid Data Specification

March 29, 2017
By
Coordinatized Data: A Fluid Data Specification

Authors: John Mount and Nina Zumel. Introduction It has been our experience when teaching the data wrangling part of data science that students often have difficulty understanding the conversion to and from row-oriented and column-oriented data formats (what is commonly called pivoting and un-pivoting). Boris Artzybasheff illustration Real trust and understanding of this concept doesn’t … Continue reading Coordinatized Data: A Fluid Data Specification

Read more »

Debugging Pipelines in R with Bizarro Pipe and Eager Assignment

March 25, 2017
By
Debugging Pipelines in R with Bizarro Pipe and Eager Assignment

This is a note on debugging magrittr pipelines in R using Bizarro Pipe and eager assignment. Pipes in R The magrittr R package supplies an operator called “pipe” which is written as “%>%“. The pipe operator is partly famous due to its extensive use in dplyr and use by dplyr users. The pipe operator is … Continue reading Debugging Pipelines in R with Bizarro Pipe and Eager Assignment

Read more »

Datashader is a big deal

March 22, 2017
By
Datashader is a big deal

I recently got back from Strata West 2017 (where I ran a very well received workshop on R and Spark). One thing that really stood out for me at the exhibition hall was Bokeh plus datashader from Continuum Analytics. I had the privilege of having Peter Wang himself demonstrate datashader for me and answer a … Continue reading Datashader is a big deal

Read more »

Practical Data Science with R: ACM SIGACT News Book Review and Discount!

March 19, 2017
By
Practical Data Science with R: ACM SIGACT News Book Review and Discount!

Our book Practical Data Science with R has just been reviewed in Association for Computing Machinery Special Interest Group on Algorithms and Computation Theory (ACM SIGACT) News by Dr. Allan M. Miller (U.C. Berkeley)! The book is half off at Manning March 21st 2017 using the following code (please share/Tweet): Deal of the Day March … Continue reading Practical Data Science with R: ACM SIGACT News Book Review and Discount!

Read more »


Subscribe

Email:

  Subscribe