Someone who wishes to remain anonymous writes:
This paper [“p-Hacking and False Discovery in A/B Testing,” by Ron Berman, Leonid Pekelis, Aisling Scott, and Christophe Van den Bulte] ostensibly provides evidence of “p-hacking” in online experimentation (A/B testing) by looking at the decision to stop experiments right around thresholds for the platform presenting confidence that A beats B (which is just a transformation of the p-value).
It is a regression discontinuity design:
Indeed, the above regression discontinuity fits look pretty bad, as can be seen by imagining the scatterplots without those superimposed curves.
My correspondent continues:
The whole thing has forking paths and multiple comparisons all over it: they consider many different thresholds, then use both linear and quadratic fits with many different window sizes (not selected via standard methods), and then later parts of the paper focus only on the specifications that are the most significant (p less than 0.05, but p greater than 0.1).
Huh? Maybe he means “greater than 0.05, less than 0.1”? Whatever.
Anyway, he continues:
Example table (this is the one that looks best for them, others relegated to appendix):
So maybe an interesting tour of:
– How much optional stopping is there in industry? (Of course there is some.)
– Self-deception, ignorance, and incentive problems for social scientists
– Reasonable methods for regression discontinuity designs.
I’ve not read the paper in detail, so I’ll just repeat that I prefer to avoid the term “p-hacking,” which, to me, implies a purposeful gaming of the system. I prefer the expression “garden of forking paths” which allows for data-dependence in analysis, even without the researchers realizing it.
Also . . . just cos the analysis has statistical flaws, it doesn’t mean that the central claims of the paper in question are false. These could be true statements, even if they don’t quite have good enough data to prove them.
And one other point: There’s nothing at all wrong with data-dependent stopping rules. The problem is all in the use of p-values for making decisions. Use the data-dependent stopping rules, use Bayesian decision theory, and it all works out.