Statisticians use n to denote the number of subjects in a data set and p to denote nearly everything else. You’re supposed to know from context what each p means.
In the phrase “big n, little p” the symbol p means the number of measurements per subject. Traditional data sets are “big n, little p” because you have far more subjects than measurements per subject. For example, maybe you measure 10 things about 1000 patients.
Big data sets, such as those coming out of bioinformatics, are often “big p, little n.” For example, maybe you measure 20,000 biomarkers on 50 patients. This turns classical statistics sideways, literally and figuratively, literally in the sense that a “big p, little n” data set looks like the transpose of a “big n, little p” data set.
From the vantage point of a traditional statistician, “big p, little n” data sets give you very little to work with. If n is small, it doesn’t matter how big p is. In the example above, n = 50, not a big data set. But the biologist will say “What do you mean it’s not a big data set? I’ve given you 1,000,000 measurements!”
So how to you take advantage of large p even though n is small? That’s the big question. It summarizes the research program of many people in statistics and machine learning. There’s no general answer, at least not yet, though progress is being made in specific applications.
Related post: Nomenclatural abomination
For daily tips on data science, follow @DataSciFact on Twitter.
Please comment on the article here: Statistics – John D. Cook