# Posts Tagged ‘ Tutorials ’

## Video Tutorial – Rolling 2 Dice: An Intuitive Explanation of The Central Limit Theorem

$Video Tutorial – Rolling 2 Dice: An Intuitive Explanation of The Central Limit Theorem$

According to the central limit theorem, if random variables, , are independent and identically distributed, is sufficiently large, then the distribution of their sample mean, , is approximately normal, and this approximation is better as increases. One of the most remarkable aspects of the central limit theorem (CLT) is its validity for any parent distribution of […]

## Side-by-Side Box Plots with Patterns From Data Sets Stacked by reshape2 and melt() in R

Introduction A while ago, one of my co-workers asked me to group box plots by plotting them side-by-side within each group, and he wanted to use patterns rather than colours to distinguish between the box plots within a group; the publication that will display his plots prints in black-and-white only.  I gladly investigated how to […]

## Video Tutorial – The Hazard Function is the Probability Density Function Divided by the Survival Function

$Video Tutorial – The Hazard Function is the Probability Density Function Divided by the Survival Function$

In an earlier video, I introduced the definition of the hazard function and broke it down into its mathematical components.  Recall that the definition of the hazard function for events defined on a continuous time scale is . Did you know that the hazard function can be expressed as the probability density function (PDF) divided by the […]

## Software Carpentry at UVA, Redux

March 12, 2014
By

Software Carpentry is an international collaboration backed by Mozilla and the Sloan Foundation comprising a team of volunteers that teach computational competence and basic programming skills to scientists. In addition to a suite of online lessons, ...

## Less wordy R

March 11, 2014
By

The Swarm Lab presents a nice comparison of R and Python code for a simple (read ‘one could do it in Excel’) problem. The example works, but I was surprised by how wordy the R code was and decided to check if one could easily produce a shorter version. The beginning is pretty much the […]

## Can a classifier that never says “yes” be useful?

March 8, 2014
By

Many data science projects and presentations are needlessly derailed by not having set shared business relevant quantitative expectations early on (for some advice see Setting expectations in data science projects). One of the most common issues is the common layman expectation of “perfect prediction” from classification projects. It is important to set expectations correctly so […] Related posts: Setting expectations in data science projects More on ROC/AUC On Being a…

## Useful Functions in R for Manipulating Text Data

$Useful Functions in R for Manipulating Text Data$

Introduction In my current job, I study HIV at the genetic and biochemical levels.  Thus, I often work with data involving the sequences of nucleotides or amino acids of various patient samples of HIV, and this type of work involves a lot of manipulating text.  (Strictly speaking, I analyze sequences of nucleotides from DNA that are reverse-transcribed from […]

## Bad Bayes: an example of why you need hold-out testing

February 1, 2014
By

We demonstrate a dataset that causes many good machine learning algorithms to horribly overfit. The example is designed to imitate a common situation found in predictive analytic natural language processing. In this type of application you are often building a model using many rare text features. The rare text features are often nearly unique k-grams […] Related posts: Don’t use correlation to track prediction performance Generalized linear models for predicting…

## Video Tutorial: Breaking Down the Definition of the Hazard Function

$Video Tutorial: Breaking Down the Definition of the Hazard Function$

The hazard function is a fundamental quantity in survival analysis.  For an event occurring at some time on a continuous time scale, the hazard function, , for that event is defined as , where is the time, is the time of the occurrence of the event. However, what does this actually mean?  In this Youtube […]

## Coursera Specializations: Data Science, Systems Biology, Python Programming

January 22, 2014
By

I first mentioned Coursera about a year ago, when I hired a new analyst in my core. This new hire came in as a very competent Python programmer with a molecular biology and microbial ecology background, but with very little experience in statistics. I ...