Posts Tagged ‘ sparklyr ’

Working With R and Big Data: Use Replyr

July 6, 2017
By

In our latest R and Big Data article we discuss replyr. Why replyr replyr stands for REmote PLYing of big data for R. Why should R users try replyr? Because it lets you take a number of common working patterns and apply them to remote data (such as databases or Spark). replyr allows users to … Continue reading Working With R and Big Data: Use Replyr

Read more »

Managing intermediate results when using R/sparklyr

June 9, 2017
By
Managing intermediate results when using R/sparklyr

In our latest “R and big data” article we show how to manage intermediate results in non-trivial Apache Spark workflows using R, sparklyr, dplyr, and replyr. Handle management Many Sparklyr tasks involve creation of intermediate or temporary tables. This can be through dplyr::copy_to() and through dplyr::compute(). These handles can represent a reference leak and eat … Continue reading Managing intermediate results when using R/sparklyr

Read more »

There is usually more than one way in R

June 5, 2017
By

Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”): There should be one– and preferably only one –obvious way to do it. Frankly in R (especially once you add many packages) there is usually more than one way. As an example we will talk about the common R functions: … Continue reading There is usually more than one way in R

Read more »

Summarizing big data in R

May 30, 2017
By

Our next "R and big data tip" is: summarizing big data. We always say "if you are not looking at the data, you are not doing science"- and for big data you are very dependent on summaries (as you can’t actually look at everything). Simple question: is there an easy way to summarize big data … Continue reading Summarizing big data in R

Read more »

Managing Spark data handles in R

May 26, 2017
By
Managing Spark data handles in R

When working with big data with R (say, using Spark and sparklyr) we have found it very convenient to keep data handles in a neat list or data_frame. Please read on for our handy hints on keeping your data handles neat. When using R to work over a big data system (such as Spark) much … Continue reading Managing Spark data handles in R

Read more »

Going to Strata / Hadoop World 2017 San Jose?

February 4, 2017
By

Are you attending or considering attending Strata / Hadoop World 2017 San Jose? Are you interested in learning to use R to work with Spark and h2o? Then please consider signing up for my 3 1/2 hour workshop soon. We are about half full now, but I really want to fill the room, while making … Continue reading Going to Strata / Hadoop World 2017 San Jose?

Read more »

Upcoming Win-Vector LLC public speaking engagements

January 28, 2017
By

I am happy to announce a couple of exciting upcoming Win-Vector LLC public speaking engagements. BARUG Meetup Tuesday, Tuesday February 7, 2017 ~7:50pm, Intuit, Building 20, 2600 Marine Way, Mountain View, CA. Win-Vector LLC’s John Mount will be giving a “lightning talk” (15 minutes) on R calling conventions (standard versus non-standard) and showing how to … Continue reading Upcoming Win-Vector LLC public speaking engagements

Read more »

Organize your data manipulation in terms of “grouped ordered apply”

December 15, 2016
By
Organize your data manipulation in terms of “grouped ordered apply”

Consider the common following problem: compute for a data set (say the infamous iris example data set) per-group ranks. Suppose we want the rank of iris Sepal.Lengths on a per-Species basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, and window functions. It also is … Continue reading Organize your data manipulation in terms of “grouped ordered apply”

Read more »

The case for index-free data manipulation

December 10, 2016
By
The case for index-free data manipulation

Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit … Continue reading The case for index-free data manipulation

Read more »


Subscribe

Email:

  Subscribe