Posts Tagged ‘ Spark ’

Working With R and Big Data: Use Replyr

July 6, 2017
By

In our latest R and Big Data article we discuss replyr. Why replyr replyr stands for REmote PLYing of big data for R. Why should R users try replyr? Because it lets you take a number of common working patterns and apply them to remote data (such as databases or Spark). replyr allows users to … Continue reading Working With R and Big Data: Use Replyr

Read more »

Managing intermediate results when using R/sparklyr

June 9, 2017
By
Managing intermediate results when using R/sparklyr

In our latest “R and big data” article we show how to manage intermediate results in non-trivial Apache Spark workflows using R, sparklyr, dplyr, and replyr. Handle management Many Sparklyr tasks involve creation of intermediate or temporary tables. This can be through dplyr::copy_to() and through dplyr::compute(). These handles can represent a reference leak and eat … Continue reading Managing intermediate results when using R/sparklyr

Read more »

There is usually more than one way in R

June 5, 2017
By

Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”): There should be one– and preferably only one –obvious way to do it. Frankly in R (especially once you add many packages) there is usually more than one way. As an example we will talk about the common R functions: … Continue reading There is usually more than one way in R

Read more »

Summarizing big data in R

May 30, 2017
By

Our next "R and big data tip" is: summarizing big data. We always say "if you are not looking at the data, you are not doing science"- and for big data you are very dependent on summaries (as you can’t actually look at everything). Simple question: is there an easy way to summarize big data … Continue reading Summarizing big data in R

Read more »

Managing Spark data handles in R

May 26, 2017
By
Managing Spark data handles in R

When working with big data with R (say, using Spark and sparklyr) we have found it very convenient to keep data handles in a neat list or data_frame. Please read on for our handy hints on keeping your data handles neat. When using R to work over a big data system (such as Spark) much … Continue reading Managing Spark data handles in R

Read more »

New screencast: using R and RStudio to install and experiment with Apache Spark

March 15, 2017
By

I have new short screencast up: using R and RStudio to install and experiment with Apache Spark. More material from my recent Strata workshop Modeling big data with R, sparklyr, and Apache Spark can be found here.

Read more »

replyr: Get a Grip on Big Data in R

March 5, 2017
By
replyr: Get a Grip on Big Data in R

replyr is an R package that contains extensions, adaptions, and work-arounds to make remote R dplyr data sources (including big data systems such as Spark) behave more like local data. This allows the analyst to more easily develop and debug procedures that simultaneously work on a variety of data services (in-memory data.frame, SQLite, PostgreSQL, and … Continue reading replyr: Get a Grip on Big Data in R

Read more »

A quick introduction to Apache Spark for statisticians

February 8, 2017
By
A quick introduction to Apache Spark for statisticians

Introduction Apache Spark is a Scala library for analysing "big data". It can be used for analysing huge (internet-scale) datasets distributed across large clusters of machines. The analysis can be anything from the computation of simple descriptive statistics associated with the datasets, through to rather sophisticated machine learning pipelines involving data pre-processing, transformation, nonlinear model … Continue reading A quick introduction to Apache Spark for statisticians

Read more »

Going to Strata / Hadoop World 2017 San Jose?

February 4, 2017
By

Are you attending or considering attending Strata / Hadoop World 2017 San Jose? Are you interested in learning to use R to work with Spark and h2o? Then please consider signing up for my 3 1/2 hour workshop soon. We are about half full now, but I really want to fill the room, while making … Continue reading Going to Strata / Hadoop World 2017 San Jose?

Read more »

Upcoming Win-Vector LLC public speaking engagements

January 28, 2017
By

I am happy to announce a couple of exciting upcoming Win-Vector LLC public speaking engagements. BARUG Meetup Tuesday, Tuesday February 7, 2017 ~7:50pm, Intuit, Building 20, 2600 Marine Way, Mountain View, CA. Win-Vector LLC’s John Mount will be giving a “lightning talk” (15 minutes) on R calling conventions (standard versus non-standard) and showing how to … Continue reading Upcoming Win-Vector LLC public speaking engagements

Read more »


Subscribe

Email:

  Subscribe