In this note we share a quick study timing how long it takes to perform some simple data manipulation tasks with R data.frames. We are interested in the time needed to select a column, alter a column, or select a row. Knowing what is fast and what is slow is critical in planning code, so … Continue reading Timing Working With a Row or a Column from a data.frame
I would like to write a bit on the meaning and history of the phrase “tidy data.” Hadley Wickham has been promoting the term “tidy data.” For example in an eponymous paper, he wrote: In tidy data: Each variable forms a column. Each observation forms a row. Each type of observational unit forms a table. … Continue reading What is “Tidy Data”
From the recent developer.r-project.org “Staged Install” article: Incidentally, there were just two distinct (very long) lists of methods in the warnings across all installed packages in my run, but repeated for many packages. It turned out that they were lists of exported methods from dplyr and rlang packages. These two packages take very long to … Continue reading Not Always C++’s Fault
The (matter of opinion) claim: “When the use of C++ is very limited and easy to avoid, perhaps it is the best option to do that […]” (source discussed here) got me thinking: does our own RcppDynProg package actually use C++ in a significant way? Could/should I port it to C? Am I informed enough … Continue reading Why RcppDynProg is Written in C++
“R is its packages”, so to know R we should know its popular packages (CRAN). Or put it another way: as R is a typical “the reference implementation is the specification” programming environment there is no true “de jure” R, only a de facto R. To look at popular R packages I defined “popular” as … Continue reading What are the Popular R Packages?
The recent r-project article “Use of C++ in Packages” stated as its own summary of recommendation: don’t use C++ to interface with R. A careful reading of the article exposes at least two possible meanings of this: Don’t use C++ to directly call R or directly manipulate R structures. A technical point directly argued (for … Continue reading C++ is Often Used in R Packages
There is a lot of unnecessary worry over “Non Standard Evaluation” (NSE) in R versus “Standard Evaluation” (SE, or standard “variables names refer to values” evaluation). This very author is guilty of over-discussing the issue. But let’s give this yet another try. The entire difference between NSE and regular evaluation can be summed up in … Continue reading Standard Evaluation Versus Non-Standard Evaluation in R
We recently saw this UX (user experience) question from the tidyr author as he adapts tidyr to cdata techniques. The terminology that he is not adopting from cdata is “unpivot_to_blocks()” and “pivot_to_rowrecs()”. One of the research ideas in the cdata package is that the important thing to call out is record structure. The important point … Continue reading Why we Did Not Name the cdata Transforms wide/tall/long/short
We recently commented on excess package dependencies as representing risk in the R package ecosystem. The question remains: how much risk? Is low dependency a mere talisman, or is there evidence it is a good practices (or at least correlates with other good pracices)? Well, it turns out we can quantify it: each additional non-core … Continue reading Quantifying R Package Dependency Risk
Dirk Eddelbuettel just shared an important point on software and analyses: dependencies are hard to manage risks. If your software or research depends on many complex and changing packages, you have no way to establish your work is correct. This is because to establish the correctness of your work, you would need to also establish … Continue reading Software Dependencies and Risk
I am collecting here some notes on testing in R. There seems to be a general (false) impression among non R-core developers that to run tests, R package developers need a test management system such as RUnit or testthat. And a further false impression that testthat is the only R test management system. This is … Continue reading Unit Tests in R
Let’s try some “ugly corner cases” for data manipulation in R. Corner cases are examples where the user might be running to the edge of where the package developer intended their package to work, and thus often where things can go wrong. Let’s see what happens when we try to stick a fork in the … Continue reading Data Manipulation Corner Cases
Recently ran into something interesting in the R macros/quasi-quotation/substitution/syntax front:
It appears !! is no longer the last word in substitution (it certainly wasn’t the first).
The described effect is actually already pretty easy t…
Recently Hadley Wickham prescribed pronouncing the magrittr pipe as “then” and using right-assignment as follows: I am not sure if it is a good or bad idea. But let’s play with it a bit, and perhaps readers can submit their experience and opinions in the comments section. Right assignment Right assignment is a bit of … Continue reading Playing With Pipe Notations
Manning has a new discount code and a free excerpt of our book Practical Data Science with R, 2nd Edition: here.
This section is elementary, but things really pick up speed as later on (also available in a paid preview).
In our cdata R package and training materials we emphasize the record-oriented thinking and how to design a transform control table. We now have an additional exciting new feature: control table keys. The user can now control which columns of a cdata control table are the keys, including now using composite keys (that is keys … Continue reading cdata Control Table Keys
To make teaching R quasi-quotation easier it would be nice if R string-interpolation and quasi-quotation both used the same notation. They are related concepts. So some commonality of notation would actually be clarifying, and help teach the concepts. We will define both of the above terms, and demonstrate the relation between the two concepts. String-interpolation … Continue reading Make Teaching R Quasi-Quotation Easier
While working on a variation of the RcppDynProg algorithm we derived the following beautiful identity of 2 by 2 real matrices: The superscript “top” denoting the transpose operation, the ||.||^2_2 denoting sum of squares norm, and the single |.| denoting determinant. This is derived from one of the check equations for the Moore–Penrose inverse and … Continue reading A Beautiful 2 by 2 Matrix Identity
While developing the RcppDynProg R package I took a little extra time to port the core algorithm from C++ to both R and Python. This means I can time the exact same algorithm implemented nearly identically in each of these three languages. So I can extract some comparative “apples to apples” timings. Please read on … Continue reading Timing the Same Algorithm in R, Python, and C++
In our last note we used wrapr::qe() to help quote expressions. In this note we will discuss quoting and code-capturing interfaces (interfaces that capture user source code) a bit more. My position on code-capturing interfaces (or non-standard-evaluation/NSE) is: if poorly handled, they can be a large interface price/risk to pay for the minor convenience of … Continue reading Quoting Concatenate