Here is simple modeling problem in R. We want to fit a linear model where the names of the data columns carrying the outcome to predict (y), the explanatory variables (x1, x2), and per-example row weights (wt) are given to us as strings. Lets start with our example data and parameters. The point is: we … Continue reading Programming Over lm() in R
My favorite R data.table feature is the “by” grouping notation when combined with the := notation.
Let’s take a look at this powerful notation.
First, let’s build an example data.frame.
d <- wrapr::build_frame(
There is interest in converting relational query languages (that work both over SQL databases and on local data) into data.table commands, to take advantage of data.table‘s superior performance. Obviously if one wants to use data.table it is best to learn data.table. But if we want code that can run multiple places a translation layer may … Continue reading data.table is Much Better Than You Have Been Told
Nina and I have been sending out drafts of our book Practical Data Science with R 2nd Edition for technical review. A few of the reviews came back from reviewers that described themselves with variations of: Senior Business Analyst for COMPANYNAME. I have been involved in presenting graphs of data for many years. To us … Continue reading Technical books are amazing opportunities
In this note we share a quick study timing how long it takes to perform some simple data manipulation tasks with R data.frames. We are interested in the time needed to select a column, alter a column, or select a row. Knowing what is fast and what is slow is critical in planning code, so … Continue reading Timing Working With a Row or a Column from a data.frame
I would like to write a bit on the meaning and history of the phrase “tidy data.” Hadley Wickham has been promoting the term “tidy data.” For example in an eponymous paper, he wrote: In tidy data: Each variable forms a column. Each observation forms a row. Each type of observational unit forms a table. … Continue reading What is “Tidy Data”
From the recent developer.r-project.org “Staged Install” article: Incidentally, there were just two distinct (very long) lists of methods in the warnings across all installed packages in my run, but repeated for many packages. It turned out that they were lists of exported methods from dplyr and rlang packages. These two packages take very long to … Continue reading Not Always C++’s Fault
The (matter of opinion) claim: “When the use of C++ is very limited and easy to avoid, perhaps it is the best option to do that […]” (source discussed here) got me thinking: does our own RcppDynProg package actually use C++ in a significant way? Could/should I port it to C? Am I informed enough … Continue reading Why RcppDynProg is Written in C++
“R is its packages”, so to know R we should know its popular packages (CRAN). Or put it another way: as R is a typical “the reference implementation is the specification” programming environment there is no true “de jure” R, only a de facto R. To look at popular R packages I defined “popular” as … Continue reading What are the Popular R Packages?
The recent r-project article “Use of C++ in Packages” stated as its own summary of recommendation: don’t use C++ to interface with R. A careful reading of the article exposes at least two possible meanings of this: Don’t use C++ to directly call R or directly manipulate R structures. A technical point directly argued (for … Continue reading C++ is Often Used in R Packages
There is a lot of unnecessary worry over “Non Standard Evaluation” (NSE) in R versus “Standard Evaluation” (SE, or standard “variables names refer to values” evaluation). This very author is guilty of over-discussing the issue. But let’s give this yet another try. The entire difference between NSE and regular evaluation can be summed up in … Continue reading Standard Evaluation Versus Non-Standard Evaluation in R
We recently saw this UX (user experience) question from the tidyr author as he adapts tidyr to cdata techniques. The terminology that he is not adopting from cdata is “unpivot_to_blocks()” and “pivot_to_rowrecs()”. One of the research ideas in the cdata package is that the important thing to call out is record structure. The important point … Continue reading Why we Did Not Name the cdata Transforms wide/tall/long/short
We recently commented on excess package dependencies as representing risk in the R package ecosystem. The question remains: how much risk? Is low dependency a mere talisman, or is there evidence it is a good practices (or at least correlates with other good pracices)? Well, it turns out we can quantify it: each additional non-core … Continue reading Quantifying R Package Dependency Risk
Dirk Eddelbuettel just shared an important point on software and analyses: dependencies are hard to manage risks. If your software or research depends on many complex and changing packages, you have no way to establish your work is correct. This is because to establish the correctness of your work, you would need to also establish … Continue reading Software Dependencies and Risk
I am collecting here some notes on testing in R. There seems to be a general (false) impression among non R-core developers that to run tests, R package developers need a test management system such as RUnit or testthat. And a further false impression that testthat is the only R test management system. This is … Continue reading Unit Tests in R
Let’s try some “ugly corner cases” for data manipulation in R. Corner cases are examples where the user might be running to the edge of where the package developer intended their package to work, and thus often where things can go wrong. Let’s see what happens when we try to stick a fork in the … Continue reading Data Manipulation Corner Cases
Recently ran into something interesting in the R macros/quasi-quotation/substitution/syntax front:
It appears !! is no longer the last word in substitution (it certainly wasn’t the first).
The described effect is actually already pretty easy t…
Recently Hadley Wickham prescribed pronouncing the magrittr pipe as “then” and using right-assignment as follows: I am not sure if it is a good or bad idea. But let’s play with it a bit, and perhaps readers can submit their experience and opinions in the comments section. Right assignment Right assignment is a bit of … Continue reading Playing With Pipe Notations
Manning has a new discount code and a free excerpt of our book Practical Data Science with R, 2nd Edition: here.
This section is elementary, but things really pick up speed as later on (also available in a paid preview).
In our cdata R package and training materials we emphasize the record-oriented thinking and how to design a transform control table. We now have an additional exciting new feature: control table keys. The user can now control which columns of a cdata control table are the keys, including now using composite keys (that is keys … Continue reading cdata Control Table Keys