(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)
Dan Silitonga writes:
I was wondering whether you would have any advice on building a regression model on a very small datasets. I’m in the midst of revamping the model to predict tax collections from unincorporated businesses. But I only have 27 data points, 27 years of annual data. Any advice would be much appreciated.
My reply:
This sounds tough, especially given that 27 years of annual data isn’t even 27 independent data points.
I have various essentially orthogonal suggestions:
1 [added after seeing John Cook's comment below]. Do your best, making as many assumptions as you need. In a Bayesian context, this means that you’d use a strong and informative prior and let the data update it as appropriate. In a less formal setting, you’d start with a guess of a model and then alter it to the extent that your data contradict your original guess.
2. Get more data. Not by getting information on more years (I assume you can’t do that) but by breaking up the data you do have, for example by geography, or class of business, or size of business, or some other factor. Or could each business be a data point? What I’m getting at is, it seems that you must have a lot more than 27 pieces of information you could analyze.
3. With a small n and many predictors, you often can’t come to a good story about what is happening but you can still rule out a lot of potential stories. For example, suppose you have 20 candidate predictors. You can’t just throw these into a regression. But you can correlate each of the predictors with the outcome, one at a time, and discover either a very close predictive relation with one or more of the separate predictors, or no such relation. Either way, you’ve learned something. It ain’t nothing to know that none of these 20 inputs determines the output all by itself.
4. You can combine predictors. For example, if you have 5 similar predictors, each measuring some aspect of a common input, you can average them (after rescaling, if necessary) and then use that average as a single predictor. Bill James did that sort of thing in his baseball analyses. Instead of throwing all his variables into a regression, he’d use theory (of a sort) and some data analysis to compute composite scores such as “runs created” and then use these composites in his further analyses.
Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science
