We are sharing a chalk talk rehearsal on applied probability. We use basic notions of probability theory to work through the estimation of sample size needed to reliably estimate event rates. This expands basic calculations, and then moves to the idea…
Nina and I have been sending out drafts of our book Practical Data Science with R 2nd Edition for technical review. A few of the reviews came back from reviewers that described themselves with variations of: Senior Business Analyst for COMPANYNAME. I have been involved in presenting graphs of data for many years. To us … Continue reading Technical books are amazing opportunities
Our publisher, Manning, is running a Memorial Day sale this weekend (May 24-27, 2019), with a new offer every day. Fri: Half off all eBooks Sat: Half off all MEAPs Sun: Half off all pBooks and liveVideos Mon: Half off everything The discount code is: wm052419au. Many great opportunities to get Practical Data Science with … Continue reading Practical Data Science with R, half off sale!
We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material. Please check the lectures out. Vectors for Programmers and Data … Continue reading Free Video Lecture: Vectors for Programmers and Data Scientists
In this note we share a quick study timing how long it takes to perform some simple data manipulation tasks with R data.frames. We are interested in the time needed to select a column, alter a column, or select a row. Knowing what is fast and what is slow is critical in planning code, so … Continue reading Timing Working With a Row or a Column from a data.frame
I would like to write a bit on the meaning and history of the phrase “tidy data.” Hadley Wickham has been promoting the term “tidy data.” For example in an eponymous paper, he wrote: In tidy data: Each variable forms a column. Each observation forms a row. Each type of observational unit forms a table. … Continue reading What is “Tidy Data”
Also, Practical Data Science with R, 2nd Edition; Zumel, Mount; Manning 2019 is now content complete! It is deep into editing and soon into production!
John Mount, Nina Zumel; Win-Vector LLC 2019-04-27 In this note we will use five real life examples to demonstrate data layout transforms using the cdata R package. The examples for this note are all demo-examples from tidyr/demo/, and are mostly based on questions posted to StackOverflow. They represent a good cross-section of data layout problems, … Continue reading Data Layout Exercises
I thought I would give a personal update on our book: Practical Data Science with R 2nd edition; Zumel, Mount; Manning 2019. The second edition should be fully available this fall! Nina and I have finished up through chapter 10 (of 12), and Manning has released previews of up through chapter 7 (with more to … Continue reading Practical Data Science with R Book Update (April 2019)
Here is an example how easy it is to use cdata to re-layout your data. Tim Morris recently tweeted the following problem (corrected). Please will you take pity on me #rstats folks? I only want to reshape two variables x & y from wide to long! Starting with: d xa xb ya yb 1 1 … Continue reading Controlling Data Layout With cdata
What R users now call piping, popularized by Stefan Milton Bache and Hadley Wickham, is inline function application (this is notationally similar to, but distinct from the powerful interprocess communication and concurrency tool introduced to Unix by Douglas McIlroy in 1973). In object oriented languages this sort of notation for function application has been called … Continue reading Piping is Method Chaining
A good friend is now a professor at the University of Auckland and knew to photograph and send us this. Thanks!!!
A good friend shared with us a great picture of Practical Data Science with R, 1st Edition hanging out in Cambridge at the MIT Press Bookstore. This is as good an excuse as any to share a book update. Nina Zumel and I (John Mount) are busy revising chapters 10 and 11 of Practical Data … Continue reading Practical Data Science with R Book Update
From the recent developer.r-project.org “Staged Install” article: Incidentally, there were just two distinct (very long) lists of methods in the warnings across all installed packages in my run, but repeated for many packages. It turned out that they were lists of exported methods from dplyr and rlang packages. These two packages take very long to … Continue reading Not Always C++’s Fault
The (matter of opinion) claim: “When the use of C++ is very limited and easy to avoid, perhaps it is the best option to do that […]” (source discussed here) got me thinking: does our own RcppDynProg package actually use C++ in a significant way? Could/should I port it to C? Am I informed enough … Continue reading Why RcppDynProg is Written in C++
“R is its packages”, so to know R we should know its popular packages (CRAN). Or put it another way: as R is a typical “the reference implementation is the specification” programming environment there is no true “de jure” R, only a de facto R. To look at popular R packages I defined “popular” as … Continue reading What are the Popular R Packages?
The recent r-project article “Use of C++ in Packages” stated as its own summary of recommendation: don’t use C++ to interface with R. A careful reading of the article exposes at least two possible meanings of this: Don’t use C++ to directly call R or directly manipulate R structures. A technical point directly argued (for … Continue reading C++ is Often Used in R Packages
There is a lot of unnecessary worry over “Non Standard Evaluation” (NSE) in R versus “Standard Evaluation” (SE, or standard “variables names refer to values” evaluation). This very author is guilty of over-discussing the issue. But let’s give this yet another try. The entire difference between NSE and regular evaluation can be summed up in … Continue reading Standard Evaluation Versus Non-Standard Evaluation in R
Sunny day at the office, and new mocha cup!
As of cdata version 1.0.8 cdata implements an operator notation for data transform.
The idea is simple, yet powerful.
First let’s start with some data.
d <- wrapr::build_frame(
"id", "measure", "value" |