Data “janitor-work”The New York Times recently ran a piece on wrangling and cleaning data:“For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”Whether you call it “janitor-work,” wrangling/munging, cleaning/cleansing/scru...

In an earlier video, I showed how to calculate expected counts in a contingency table using marginal proportions and totals. (Recall that expected counts are needed to conduct hypothesis tests of independence between categorical random variables.) Today, I want to share a second video of calculating expected counts – this time, using joint probabilities. This method uses […]

The Hardy-Weinberg law is a fundamental principle in statistical genetics. If its 7 assumptions are fulfilled, then it predicts that the allelic frequency of a genetic trait will remain constant from generation to generation. In this new video tutorial in my Youtube channel, I explain the math behind the Hardy-Weinberg theorem. In particular, I clarify […]

Page 94 of Gelman, Carlin, Stern, Dunson, Vehtari, Rubin “Bayesian Data Analysis” 3rd Edition (which we will call BDA3) provides a great example of what happens when common broad frequentist bias criticisms are over-applied to predictions from ordinary linear regression: the predictions appear to fall apart. BDA3 goes on to exhibit what might be considered […] Related posts: Frequentist inference only seems easy Six Fundamental Methods to Generate a Random…

A common task in statistics and biostatistics is performing hypothesis tests of independence between 2 categorical random variables. The data for such tests are best organized in contingency tables, which allow expected counts to be calculated easily. In this video tutorial in my Youtube channel, I demonstrate how to calculate expected counts using marginal proportions […]

Two of the most common methods of statistical inference are frequentism and Bayesianism (see Bayesian and Frequentist Approaches: Ask the Right Question for some good discussion). In both cases we are attempting to perform reliable inference of unknown quantities from related observations. And in both cases inference is made possible by introducing and reasoning over […] Related posts: Bayesian and Frequentist Approaches: Ask the Right Question Automatic bias correction doesn’t…

A quick R mini-tip: don’t use data.matrix when you mean model.matrix. If you do so you may lose (without noticing) a lot of your model’s explanatory power (due to poor encoding). For some modeling tasks you end up having to prepare a special expanded data matrix before calling a given machine learning algorithm. For example […] Related posts: Level fit summaries can be tricky in R Vtreat: designing a package…

While following up on Nina Zumel’s excellent Trimming the Fat from glm() Models in R I got to thinking about code style in R. And I realized: you can make your code much prettier by designing more of your functions to return data.frames. That may seem needlessly heavy-weight, but it has a lot of down-stream […] Related posts: Prefer = for assignment in R Your Data is Never the Right…