R users have been enjoying the benefits of SQL query generators for quite some time, most notably using the dbplyr package. I would like to talk about some features of our own rquery query generator, concentrating on derived result re-use. Introduction SQL represents value use by nesting. To use a query result within another query … Continue reading Query Generation in R

# Category: Tutorials

## cdata Control Table Keys

In our cdata R package and training materials we emphasize the record-oriented thinking and how to design a transform control table. We now have an additional exciting new feature: control table keys. The user can now control which columns of a cdata control table are the keys, including now using composite keys (that is keys … Continue reading cdata Control Table Keys

## Function Objects and Pipelines in R

Composing functions and sequencing operations are core programming concepts. Some notable realizations of sequencing or pipelining operations include: Unix’s |-pipe CMS Pipelines. F#‘s forward pipe operator |>. Haskel’s Data.Function & operator. The R magrittr forward pipe. Scikit-learn‘s sklearn.pipeline.Pipeline. The idea is: many important calculations can be considered as a sequence of transforms applied to a … Continue reading Function Objects and Pipelines in R

## Fully General Record Transforms with cdata

One of the design goals of the cdata R package is that very powerful and arbitrary record transforms should be convenient and take only one or two steps. In fact it is the goal to take just about any record shape to any other in two steps: first convert to row-records, then re-block the data … Continue reading Fully General Record Transforms with cdata

## Make Teaching R Quasi-Quotation Easier

To make teaching R quasi-quotation easier it would be nice if R string-interpolation and quasi-quotation both used the same notation. They are related concepts. So some commonality of notation would actually be clarifying, and help teach the concepts. We will define both of the above terms, and demonstrate the relation between the two concepts. String-interpolation … Continue reading Make Teaching R Quasi-Quotation Easier

## R Tip: Use Inline Operators For Legibility

R Tip: use inline operators for legibility. A Python feature I miss when working in R is the convenience of Python‘s inline + operator. In Python, + does the right thing for some built in data types: It concatenates lists: [1,2] + [3] is [1, 2, 3]. It concatenates strings: ‘a’ + ‘b’ is ‘ab’. … Continue reading R Tip: Use Inline Operators For Legibility

## A Beautiful 2 by 2 Matrix Identity

While working on a variation of the RcppDynProg algorithm we derived the following beautiful identity of 2 by 2 real matrices: The superscript “top” denoting the transpose operation, the ||.||^2_2 denoting sum of squares norm, and the single |.| denoting determinant. This is derived from one of the check equations for the Moore–Penrose inverse and … Continue reading A Beautiful 2 by 2 Matrix Identity

## Timing the Same Algorithm in R, Python, and C++

While developing the RcppDynProg R package I took a little extra time to port the core algorithm from C++ to both R and Python. This means I can time the exact same algorithm implemented nearly identically in each of these three languages. So I can extract some comparative “apples to apples” timings. Please read on … Continue reading Timing the Same Algorithm in R, Python, and C++

## What does it mean to write “vectorized” code in R?

One often hears that R can not be fast (false), or more correctly that for fast code in R you may have to consider “vectorizing.” A lot of knowledgable R users are not comfortable with the term “vectorize”, and not really familiar with the method. “Vectorize” is just a slightly high-handed way of saying: R … Continue reading What does it mean to write “vectorized” code in R?

## Introducing RcppDynProg

RcppDynProg is a new Rcpp based R package that implements simple, but powerful, table-based dynamic programming. This package can be used to optimally solve the minimum cost partition into intervals problem (described below) and is useful in building piecewise estimates of functions (shown in this note). The abstract problem The primary problem RcppDynProg::solve_dynamic_program() is designed … Continue reading Introducing RcppDynProg

## vtreat Variable Importance

vtreat‘s purpose is to produce pure numeric R data.frames that are ready for supervised predictive modeling (predicting a value from other values). By ready we mean: a purely numeric data frame with no missing values and a reasonable number of columns (missing-values re-encoded with indicators, and high-degree categorical re-encode by effects codes or impact codes). … Continue reading vtreat Variable Importance

## Quoting Concatenate

In our last note we used wrapr::qe() to help quote expressions. In this note we will discuss quoting and code-capturing interfaces (interfaces that capture user source code) a bit more. My position on code-capturing interfaces (or non-standard-evaluation/NSE) is: if poorly handled, they can be a large interface price/risk to pay for the minor convenience of … Continue reading Quoting Concatenate

## Reusable Pipelines in R

Pipelines in R are popular, the most popular one being magrittr as used by dplyr. This note will discuss the advanced re-usable piping systems: rquery/rqdatatable operator trees and wrapr function object pipelines. In each case we have a set of objects designed to extract extra power from the wrapr dot-arrow pipe %.>%. Piping Piping is … Continue reading Reusable Pipelines in R

## Sharing Modeling Pipelines in R

Reusable modeling pipelines are a practical idea that gets re-developed many times in many contexts. wrapr supplies a particularly powerful pipeline notation, and a pipe-stage re-use system (notes here). We will demonstrate this with the vtreat data preparation system. Our example task is to fit a model on some arbitrary data. Our model will try … Continue reading Sharing Modeling Pipelines in R

## Quoting in R

Many R users appear to be big fans of “code capturing” or “non standard evaluation” (NSE) interfaces. In this note we will discuss quoting and non-quoting interfaces in R. The above terms are simply talking about interfaces where a name to be used is captured from the source code the user typed, and thus does … Continue reading Quoting in R

## More on Bias Corrected Standard Deviation Estimates

This note is just a quick follow-up to our last note on correcting the bias in estimated standard deviations for binomial experiments. For normal deviates there is, of course, a well know scaling correction that returns an unbiased estimate for observed standard deviations. It (from the same source): … provides an example where imposing the … Continue reading More on Bias Corrected Standard Deviation Estimates

## How to de-Bias Standard Deviation Estimates

This note is about attempting to remove the bias brought in by using sample standard deviation estimates to estimate an unknown true standard deviation of a population. We establish there is a bias, concentrate on why it is not important to remove it for reasonable sized samples, and (despite that) give a very complete bias … Continue reading How to de-Bias Standard Deviation Estimates