# Posts Tagged ‘ Tutorials ’

## A comment on preparing data for classifiers

December 4, 2014
By

I have been working through (with some honest appreciation) a recent article comparing many classifiers on many data sets: “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Dinani Amorim; 15(Oct):3133−3181, 2014 (which we will call “the DWN paper” in this note). This paper applies 179 … Continue reading A comment on preparing data for classifiers → Related posts: The Geometry…

## Performing Logistic Regression in R and SAS

Introduction My statistics education focused a lot on normal linear least-squares regression, and I was even told by a professor in an introductory statistics class that 95% of statistical consulting can be done with knowledge learned up to and including a course in linear regression.  Unfortunately, that advice has turned out to vastly underestimate the […]

## RNA-seq Data Analysis Course Materials

November 20, 2014
By

Last week I ran a one-day workshop on RNA-seq data analysis in the UVA Health Sciences Library. I set up an AWS public EC2 image with all the necessary software installed. Participants logged into AWS, launched the image, and we kicked off the morning ...

## Can we try to make an adjustment?

November 14, 2014
By

In most of our data science teaching (including our book Practical Data Science with R) we emphasize the deliberately easy problem of “exchangeable prediction.” We define exchangeable prediction as: given a series of observations with two distinguished classes of variables/observations denoted “x”s (denoting control variables, independent variables, experimental variables, or predictor variables) and “y” (denoting … Continue reading Can we try to make an adjustment? → Related posts: Don’t use…

October 30, 2014
By

Continuing our series of reading out loud from a single page of a statistics book we look at page 224 of the 1972 Dover edition of Leonard J. Savage’s “The Foundations of Statistics.” On this page we are treated to an example attributed to Leo A. Goodman in 1953 that illustrates how for normally distributed … Continue reading Bias/variance tradeoff as gamesmanship → Related posts: Automatic bias correction doesn’t fix…

## Calculating the sum or mean of a numeric (continuous) variable by a group (categorical) variable in SAS

Introduction A common task in data analysis and statistics is to calculate the sum or mean of a continuous variable.  If that variable can be categorized into 2 or more classes, you may want to get the sum or mean for each class. This sounds like a simple task, yet I took a surprisingly long time […]

August 26, 2014
By

What is the Gauss-Markov theorem? From “The Cambridge Dictionary of Statistics” B. S. Everitt, 2nd Edition: A theorem that proves that if the error terms in a multiple regression have the same variance and are uncorrelated, then the estimators of the parameters in the model produced by least squares estimation are better (in the sense … Continue reading Reading the Gauss-Markov theorem → Related posts: What is meant by regression…

## The Chi-Squared Test of Independence – An Example in Both R and SAS

$The Chi-Squared Test of Independence – An Example in Both R and SAS$

Introduction The chi-squared test of independence is one of the most basic and common hypothesis tests in the statistical analysis of categorical data.  Given 2 categorical random variables, and , the chi-squared test of independence determines whether or not there exists a statistical dependence between them.  Formally, it is a hypothesis test with the following null and […]

## Do your "data janitor work" like a boss with dplyr

August 20, 2014
By

Data “janitor-work”The New York Times recently ran a piece on wrangling and cleaning data:“For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”Whether you call it “janitor-work,” wrangling/munging, cleaning/cleansing/scru...