The first real data set I ever analyzed was from my senior honors thesis as an undergraduate psychology major. I had taken both intro stats and an ANOVA class, and I applied all my new skills with gusto, analyzing every which way.
It wasn’t too many years into graduate school that I realized that these data analyses were a bit haphazard, and not at all well thought out. 20 years of data analysis experience later, and I realized that’s just a symptom of being an inexperienced data analyst.
But even experienced data analysts can get off track, especially with large data sets with many variables. It’s just so easy to try different versions of models or get distracted by interesting, but irrelevant, relationships among variables.
The lesson? Make a plan.
Make a Plan
According to Frank Scarpaci, owner of Project Designworks, there is a
Every dollar spent on planning and preparation saves $10 on
project work or $100 on fixing problems after the project is done.”
I’m pretty sure that ratio holds for not just money, but time and frustration. I mean, you’d rather spend an hour now planning the analysis than two weeks redoing it after reviewers rip it to shreds, right?
The best time to plan the analysis is before collecting data. This prevents those (all too common) situations where you realize you needed another variable or you should have measured something differently. Grant applications force you to do this, but every study would benefit.
How do you plan it?
I find a great outline for an analysis plan comes from an article by Daryl Bem about writing journal articles. The entire article is excellent (and I highly recommend it), but most helpful for planning is the section, “Presenting the Findings”. This section outlines 7 steps for reporting each finding. For planning purposes, I condense these into three:
- State the conceptual hypothesis you are asking
- Restate this hypothesis in the terms of the variables that measure the concept
- List the statistical test or method that will answer this question
Simply repeat these three steps for all hypotheses the study is set up to answer. Start with the most general and important, and work down from there.
The Research Question is Central
You may have noticed that at the center is the conceptual hypothesis, or in looser terms, the research question. Everything you run should ultimately move you toward answering the research questions.
Write down your research questions and tape it to the wall near your computer.
There may be additional analyses that support the main one, and you may or may not be able to plan for them. But they should still serve the overall purpose of answering the research question.
For example, always plan on running univariate and bivariate descriptives and graphs to get a sense of your variables and their most basic relationships before you do much else.
Likewise, If you know you will need to run a factor analysis to create an index variable or deal with inevitable missing data, plan for those too.
Even the best plans, though, are guidelines. Surprises do come up (both good and bad), and you will probably have to adjust it as you go along. But don’t let that stop you from planning.
When you don’t know which tests answer the research question
“But wait a minute. I know the research question. I just don’t know know which statistics to use to answer them. What about those?” (I can hear you right now.)
The third step in planning is to choose the statistical test(s) to answer that research question. It’s impossible to list all the things to consider in choosing a statistical test, and there often isn’t just one option.
But here are some general guidelines. The statistical test must:
- Answer the research question.If your research question requires controlling for variables, your test needs to have that ability. If the research question is about group differences, the test needs to be able to compare groups. This is why being specific is so important.
- Take into account the design of the study.Unless it was designed to accommodate other situations, most statistical tests assume simple random samples of independent measurements.If your sample is stratified or clustered; if measurements are repeated over time or space; or some other design issue led measurements to be beyond simple, the test needs to accommodate that.
- Take into account the level of measurement of the independent and dependent variables.The exact same research question from the same design will use different statistical methods if the dependent variable is measured by a categorical variable than if it’s measured by a numerical variable.
- Deal with any issues in the data, like influential outliers, multicollinearity, truncation, missing data.Unlike the three steps above, you can’t always anticipate data issues, and you can’t always deal with them in the main analysis. You may have to use preliminary tests to deal with them first.
Please comment on the article here: The Analysis Factor