Andy Warhol said “In the future, everyone will be world-famous for 15 minutes.” Here’s my 15 seconds of fame, a soundbite from the IBM Insight conference last year. My comments start at 1:30. In a nutshell, I predict that data analyt...

What’s the connection between the hypergeometric distributions, hypergeometric functions, and hypergeometric series? The hypergeometric distribution is a probability distribution with parameters N, M, and n. Suppose you have an urn containing N balls, M red and the rest, N – M blue and you select n balls at a time. The hypergeometric distribution gives the probability of selecting k red balls. The probability generating function […]

“Reproducible” and “randomized” don’t seem to go together. If something was unpredictable the first time, shouldn’t it be unpredictable if you start over and run it again? As is often the case, we want incompatible things. But the combination of reproducible and random can be reconciled. Why would we want a randomized controlled trial (RCT) to […]

Long run or broken software? I got a call one time to take a look at randomization software that wasn’t randomizing. My first thought was that the software was working as designed, and that the users were just seeing a long run. Long sequences of the same assignment are more likely than you think. You […]

Statisticians use n to denote the number of subjects in a data set and p to denote nearly everything else. You’re supposed to know from context what each p means. In the phrase “big n, little p” the symbol p means the number of measurements per subject. Traditional data sets are “big n, little p” […]

Suppose project completion time follows a Pareto (power law) distribution with parameter α. That is, for t > 1, the probability that completion time is bigger than t is t-α. (We start out time at t = 1 because that makes the calculations a little simpler.) Now suppose we know that a project has lasted […]

This is what the book Social Media Mining calls the Big Data Paradox: Social media data is undoubtedly big. However, when we zoom into individuals for whom, for example, we would like to make relevant recommendations, we often have little data for each specific individual. We have to exploit the characteristics of social media and […]

Suppose you have data from a discrete power law with exponent α. That is, the probability of an outcome n is proportional to n-α. How can you recover α? A naive approach would be to gloss over the fact that you have discrete data and use the MLE (maximum likelihood estimator) for continuous data. That […]

The article Deming, data and observational studies by S. Stanley Young and Alan Karr opens with Any claim coming from an observational study is most likely to be wrong. They back up this assertion with data about observational studies later contradicted by prospective studies. Much has been said lately about the assertion that most published results are false, particularly […]

A/B testing, or split testing, is commonly used in web marketing to decide which of two design options performs better. If you have so many visitors to a site that the number of visitors used in a test is negligible, conventional randomization schemes are the way to go. They’re simple and effective. But if you […]