Simulation-based statistical testing in journalism

Jonathan Stray writes:

In my recent Algorithms in Journalism course we looked at a post which makes a cute little significance-type argument that five Trump campaign payments were actually the \$130,000 Daniels payoff. They summed to within a dollar of \$130,000, so the simulation recreates sets of payments using bootstrapping and asks how often there’s a subset that gets that close to \$130,000. It concludes “very rarely” and therefore that this set of payments was a coverup.

(This is part of my broader collection of simulation-based significance testing in journalism.)

I recreated this payments simulation in a notebook to explore this. The original simulation checks sets of ten payments, which the authors justify because “we’re trying to estimate the odds of the original discovery, which was found in a series of eight or so payments.” You get about p=0.001 that any set of ten payments gets within \$1 of \$130,000. But the authors also calculated p=0.1 or so if we choose from 15, and my notebook shows this that goes up rapidly to p=0.8 if you choose 20 payments.

So the inference you make depends crucially on the universe of events you use. I think of this as the denominator in the frequentist calculation. It seems like a free parameter robustness problem, and for me it casts serious doubt on the entire exercise.

My question is: Is there a principled way to set the denominator in a test like this? I don’t really see one.

I’d be much more comfortable with fully Bayesian attempt, modeling the generation process for the entire observed payment stream with and without a Daniels payoff. Then the result would be expressed as a Bayes factor which I would find a lot easier to interpret — and this would also use all available data and require making a bunch of domain assumptions explicit, which strikes me as a good thing.

But I do still wonder if frequentist logic can answer the denominator question here. It feels like I’m bumping up against a deep issue here, but I just can’t quite frame it right.

Most fundamentally, I worry that that there is no domain knowledge in this significance test. How does this data relate to reality? What are the FEC rules and typical campaign practice for what is reported and when? When politicians have pulled shady stuff in the past, how did it look in the data? We desperately need domain knowledge here. For an example of what application of domain knowledge to significance testing looks like, see Carl Bialik’s critique of statistical tests for tennis fixing.