## Can a classifier that never says “yes” be useful?

March 8, 2014
Many data science projects and presentations are needlessly derailed by not having set shared business relevant quantitative expectations early on (for some advice see Setting expectations in data science projects). One of the most common issues is the common layman expectation of "perfect prediction" from classification projects. It is important to set expectations correctly so […]

## Useful Functions in R for Manipulating Text Data

Introduction In my current job, I study HIV at the genetic and biochemical levels.  Thus, I often work with data involving the sequences of nucleotides or amino acids of various patient samples of HIV, and this type of work involves a lot of manipulating text.  (Strictly speaking, I analyze sequences of nucleotides from DNA that are reverse-transcribed from […]

## Bad Bayes: an example of why you need hold-out testing

February 1, 2014
We demonstrate a dataset that causes many good machine learning algorithms to horribly overfit. The example is designed to imitate a common situation found in predictive analytic natural language processing. In this type of application you are often building a model using many rare text features. The rare text features are often nearly unique k-grams […]

## Video Tutorial: Breaking Down the Definition of the Hazard Function

Video Tutorial: Breaking Down the Definition of the Hazard Function

The hazard function is a fundamental quantity in survival analysis.  For an event occurring at some time on a continuous time scale, the hazard function, , for that event is defined as , where is the time, is the time of the occurrence of the event. However, what does this actually mean?  In this Youtube […]

## Coursera Specializations: Data Science, Systems Biology, Python Programming

January 22, 2014
I first mentioned Coursera about a year ago, when I hired a new analyst in my core. This new hire came in as a very competent Python programmer with a molecular biology and microbial ecology background, but with very little experience in statistics. I ...

## Rectangular Integration (a.k.a. The Midpoint Rule) – Conceptual Foundations and a Statistical Application in R

Rectangular Integration (a.k.a. The Midpoint Rule) – Conceptual Foundations and a Statistical Application in R

Introduction Continuing on the recently born series on numerical integration, this post will introduce rectangular integration.  I will describe the concept behind rectangular integration, show a function in R for how to do it, and use it to check that the distribution actually integrates to 1 over its support set.  This post follows from my […]

January 19, 2014
Nassim Nicholas Taleb recently wrote an article advocating the abandonment of the use of standard deviation and advocating the use of mean absolute deviation. Mean absolute deviation is indeed an interesting and useful measure- but there is a reason that standard deviation is important even if you do not like it: it prefers models that […]

## How To Install BioPerl Without Root Privileges

January 13, 2014
I've seen this question asked and partially answered all around the web. As with anything related to Perl, I'm sure there is more than one way to do it. Here's how I do it with Perl 5.10.1 on CentOS 6.4.First, install local::lib with bootstra...

## The Extra Step: Graphs for Communication versus Exploration

January 12, 2014
Visualization is a useful tool for data exploration and statistical analysis, and it's an important method for communicating your discoveries to others. While those two uses of visualization are related, they aren't identical. One of the reasons that I like ggplot so much is that it excels at layering together multiple views and summaries of […]

## Generalized linear models for predicting rates

January 1, 2014
I often need to build a predictive model that estimates rates. The example of our age is: ad click through rates (how often a viewer clicks on an ad estimated as a function of the features of the ad and the viewer). Another timely example is estimating default rates of mortgages or credit cards. You […]