Blog Archives

Trends and Opportunities in Data Analysis

February 11, 2016
By
Trends and Opportunities in Data Analysis

Andy Warhol said “In the future, everyone will be world-famous for 15 minutes.” Here’s my 15 seconds of fame, a soundbite from the IBM Insight conference last year. My comments start at 1:30. In a nutshell, I predict that data analyt...

Read more »

Connection between hypergeometric distribution and series

February 8, 2016
By
Connection between hypergeometric distribution and series

What’s the connection between the hypergeometric distributions, hypergeometric functions, and hypergeometric series? The hypergeometric distribution is a probability distribution with parameters N, M, and n. Suppose you have an urn containing N balls, M red and the rest, N – M blue and you select n balls at a time. The hypergeometric distribution gives the probability of selecting k red balls. The probability generating function […]

Read more »

Reproducible randomized controlled trials

February 1, 2016
By
Reproducible randomized controlled trials

“Reproducible” and “randomized” don’t seem to go together. If something was unpredictable the first time, shouldn’t it be unpredictable if you start over and run it again? As is often the case, we want incompatible things. But the combination of reproducible and random can be reconciled. Why would we want a randomized controlled trial (RCT) to […]

Read more »

Random number generator seed mistakes

January 29, 2016
By
Random number generator seed mistakes

Long run or broken software? I got a call one time to take a look at randomization software that wasn’t randomizing. My first thought was that the software was working as designed, and that the users were just seeing a long run. Long sequences of the same assignment are more likely than you think. You […]

Read more »

Big p, Little n

January 7, 2016
By

Statisticians use n to denote the number of subjects in a data set and p to denote nearly everything else. You’re supposed to know from context what each p means. In the phrase “big n, little p” the symbol p means the number of measurements per subject. Traditional data sets are “big n, little p” […]

Read more »

The longer it has taken, the longer it will take

December 21, 2015
By

Suppose project completion time follows a Pareto (power law) distribution with parameter α. That is, for t > 1, the probability that completion time is bigger than t is t-α. (We start out time at t = 1 because that makes the calculations a little simpler.) Now suppose we know that a project has lasted […]

Read more »

Big data paradox

December 14, 2015
By

This is what the book Social Media Mining calls the Big Data Paradox: Social media data is undoubtedly big. However, when we zoom into individuals for whom, for example, we would like to make relevant recommendations, we often have little data for each specific individual. We have to exploit the characteristics of social media and […]

Read more »

Estimating the exponent of discrete power law data

November 24, 2015
By
Estimating the exponent of discrete power law data

Suppose you have data from a discrete power law with exponent α. That is, the probability of an outcome n is proportional to n-α. How can you recover α? A naive approach would be to gloss over the fact that you have discrete data and use the MLE (maximum likelihood estimator) for continuous data. That […]

Read more »

Skin in the game for observational studies

November 4, 2015
By

The article Deming, data and observational studies by S. Stanley Young and Alan Karr opens with Any claim coming from an observational study is most likely to be wrong. They back up this assertion with data about observational studies later contradicted by prospective studies. Much has been said lately about the assertion that most published results are false, particularly […]

Read more »

Balancing profit and learning in A/B testing

October 28, 2015
By

A/B testing, or split testing, is commonly used in web marketing to decide which of two design options performs better. If you have so many visitors to a site that the number of visitors used in a test is negligible, conventional randomization schemes are the way to go. They’re simple and effective. But if you […]

Read more »


Subscribe

Email:

  Subscribe