(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Hey kids! Time to think about writing that statistics Ph.D. thesis.

It would be great to write something on a cool applied project, but: (a) you might not be connected to a cool applied project, and you typically can’t do these on your own, you need collaborators who know what they’re doing and who care about getting the right answer; and (b) you’re in your doctoral program learning all this theory, so now’s the time to *really* learn that theory, by using it!

So here we are at Statistical Modeling, Causal Inference, and Social Science to help you out. Yes, that’s right, we have a thesis topic for you!

The basic idea is here, a post that was written several months ago but just happened to appear this morning. Here’s what’s going on: In various areas of the human sciences, it’s been popular to hypothesize, or apparently experimentally prove, that all sorts of seemingly trivial interventions can have large effects. You’ve all heard of the notorious claim, unsupported by data, “That a person can, by assuming two simple 1-min poses, embody power and instantly become more powerful has real-world, actionable implications,” but that’s just one of many many examples. We’ve also been told that whether a hurricane has a boy or a girl name has huge effects on evacuation behavior; we’ve been told that male college students with fat or thin arms have different attitudes toward economic redistribution, with that difference depending crucially on the socioeconomic status of their parents; we’ve been told that women’s voting behavior varies by a huge amount based on the time of the month, with that difference depending crucially on their relationship status; we’ve been told that political and social attitudes and behavior can be shifted in consistent ways by shark attacks and college football games and subliminal smiley faces and chance encounters with strangers on the street and, ummm, being “exposure to an incidental black and white visual contrast.” You get the idea.

But that’s just silly science, it’s not a Ph.D. thesis topic in statistical theory—yet.

Here’s where the theory comes in. I’ve written about the piranha problem, that these large and consistent effects can’t all, or even mostly, be happening. The problem is that they would interfere with each other: On one hand, you can’t have dozens of large and consistent *main effects* or else it would be possible to push people’s opinions and behavior to ridiculously implausible lengths just by applying several stimuli in sequence (for example, football game plus shark attack plus fat arms plus an encounter on the street). On the other hand, once you allow these effects to have *interactions*, it becomes less possible for them to be detected in any generalizable way in an experiment. (For example, the names of the hurricanes could be correlated with recent football games, shark attacks, etc.)

We had some discussion of this idea in the comment thread (that’s where I got off the quip, “Yes, in the linked article, Dijksterhuis writes, ‘The idea that merely being exposed to something that may then exert some kind of influence is not nearly as mystifying now as it was twenty years ago.’ But the thing he doesn’t seem to realize is that, as Euclid might put it, there are an infinite number of primes…”, and what I’m thinking would really make the point clear would be to demonstrate it theoretically, using some sort of probability model (or, more generally, mathematical model) of effects and interactions.

A proof of the piranha principle, as it were. Some sort of asymptotic result as the number of potential effects increases. I really like this idea: it makes sense, it seems amenable to theoretical study, it could be modeled in various different ways, it’s important for science and engineering (you’ll have the same issue when considering A/B tests for hundreds of potential interventions), and it’s not trivial, mathematically or statistically.

As always, I recommend starting with fake-data simulation to get an idea of what’s going on, then move to some theory.

**P.S.** You might think: Hey, I’m reading this, but hundreds of other statistics Ph.D. students are reading this at the same time. What if *all* of them work on this one project? Then do I need to worry about getting “scooped”? The answer is, No, you don’t need to worry! First, hundreds of Ph.D. students might read this post, but only a few will pick this topic. Second, there’s a lot to do here! My first pass above is based on the normal distribution, but you could consider other distributions, also look not just at the distribution of underlying parameter values but at the distribution of estimates, you could embed the whole problem in a time series structure, you could look at varying treatment effects, there’s the whole issue of how to model interactions, there’s an entirely different approach based on hard bounds, all sorts of directions to go. And that’s not meant to intimidate you. No need to go in all these directions at once; rather, *any* of these directions will give you a great thesis project. And it will be different from everyone else’s on the topic. So get going, already! This stuff’s important, and we can use your analytical skills.

The post Important statistical theory research project! Perfect for the stat grad students (or ambitious undergrads) out there. appeared first on Statistical Modeling, Causal Inference, and Social Science.

**Please comment on the article here:** **Statistical Modeling, Causal Inference, and Social Science**