In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized linear regression, when they make … Continue reading An Ad-hoc Method for Calibrating Uncalibrated Models

# Tag: R

## Is Scholarly Use of R Use Beating SPSS Already?

by Bob Muenchen & Sean Mackinnon One of us (Muenchen) has been tracking The Popularity of Data Science Software using a variety of different approaches. One approach is to use Google Scholar to count the number of scholarly articles found … Continue reading →

## Some Details on Running xgboost

While reading Dr. Nina Zumel’s excellent note on bias in common ensemble methods, I ran the examples to see the effects she described (and I think it is very important that she is establishing the issue, prior to discussing mitigation). In doing that I ran into one more avoidable but strange issue in using xgboost: when … Continue reading Some Details on Running xgboost

## a non-riddle

Unless I missed a point in the last riddle from the Riddler, there is very little to say about it: Given N ocre balls, N aquamarine balls, and two urns, what is the optimal way to allocate the balls to the urns towards drawing an ocre ball with no urn being empty? Both my reasoning […]

## Common Ensemble Models can be Biased

In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved. However, when making predictions on individuals, a biased model may be preferable; biased models may … Continue reading Common Ensemble Models can be Biased

## CRAN does not validate R packages!

A friend called me the other day for advice on how to submit an R package to CRAN along with a proof his method was mathematically sound. I replied with some items of advice taken from my (limited) experience with submitting packages. And with the remark that CRAN would not validate the mathematical contents of […]

## Le Monde puzzle [#1105]

Another token game as Le Monde mathematical puzzle: Archibald and Beatrix play with a pile of n>100 tokens, sequentially picking m tokens from the pile with m being a prime number [including m=1] or a multiple of 6, the winner taking the last tokens. If Beatrix knows n and proposes to Archibald to start, what […]

## Link Functions versus Data Transforms

In the linear regression section of our book Practical Data Science in R, we use the example of predicting income from a number of demographic variables (age, sex, education and employment type). In the text, we choose to regress against log10(income) rather than directly against income. One obvious reason for not regressing directly against income … Continue reading Link Functions versus Data Transforms

## Link Functions versus Data Transforms

In the linear regression section of our book Practical Data Science in R, we use the example of predicting income from a number of demographic variables (age, sex, education and employment type). In the text, we choose to regress against log10(income) rather than directly against income. One obvious reason for not regressing directly against income … Continue reading Link Functions versus Data Transforms

## Programming Over lm() in R

Here is simple modeling problem in R. We want to fit a linear model where the names of the data columns carrying the outcome to predict (y), the explanatory variables (x1, x2), and per-example row weights (wt) are given to us as strings. Lets start with our example data and parameters. The point is: we … Continue reading Programming Over lm() in R

## Replicating a Linear Model

For a few of my commercial projects I have been in the seemingly strange place being asked to port a linear model from one data science system to another. Now I try to emphasize that it is better going forward to port procedures and build new models with training data. But sometimes that is not … Continue reading Replicating a Linear Model

## My Favorite data.table Feature

My favorite R data.table feature is the “by” grouping notation when combined with the := notation.

Let’s take a look at this powerful notation.

First, let’s build an example data.frame.

d <- wrapr::build_frame(

"gr…

## data.table is Much Better Than You Have Been Told

There is interest in converting relational query languages (that work both over SQL databases and on local data) into data.table commands, to take advantage of data.table‘s superior performance. Obviously if one wants to use data.table it is best to learn data.table. But if we want code that can run multiple places a translation layer may … Continue reading data.table is Much Better Than You Have Been Told

## Cohen’s D for Experimental Planning

In this note, we discuss the use of Cohen’s D for planning difference-of-mean experiments. Estimating sample size Let’s imagine you are testing a new weight loss program and comparing it so some existing weight loss regimen. You want to run an experiment to determine if the new program is more effective than the old one. … Continue reading Cohen’s D for Experimental Planning

## New Versions of R GUIs: BlueSky, JASP, jamovi

It has been only two months since I summarized my reviews of point-and-click front ends for R, and it’s already out of date! I have converted that post into a regularly-updated article and added a plot of total features, which … Continue reading →

## Le Monde puzzle [#1104]

A palindromic Le Monde mathematical puzzle: In a monetary system where all palindromic amounts between 1 and 10⁸ have a coin, find the numbers less than 10³ that cannot be paid with less than three coins. Find if 20,191,104 can be paid with two coins. Similarly, find if 11,042,019 can be paid with two or […]

## another attempt at code golf

I had another lazy weekend go at code golf, trying to code in the most condensed way the following task. Provided with a square matrix A of positive integers, keep iterating the steps take the highest square 𝑥² in A. find the smallest adjacent neighbour 𝑛 replace x² with x and n with nx until […]

## riddles on Egyptian fractions and Bernoulli factories

Two fairy different riddles on the weekend Riddler. The first one is (in fine) about Egyptian fractions: I understand the first one as Find the Egyptian fraction decomposition of 2 into 11 distinct unit fractions that maximises the smallest fraction. And which I cannot solve despite perusing this amazing webpage on Egyptian fractions and making […]

## Estimating Rates using Probability Theory: Chalk Talk

We are sharing a chalk talk rehearsal on applied probability. We use basic notions of probability theory to work through the estimation of sample size needed to reliably estimate event rates. This expands basic calculations, and then moves to the idea…

## A precursor of ABC-Gibbs

Following our arXival of ABC-Gibbs, Dennis Prangle pointed out to us a 2016 paper by Athanasios Kousathanas, Christoph Leuenberger, Jonas Helfer, Mathieu Quinodoz, Matthieu Foll, and Daniel Wegmann, Likelihood-Free Inference in High-Dimensional Model, published in Genetics, Vol. 203, 893–904 in June 2016. This paper contains a version of ABC Gibbs where parameters are sequentially simulated […]