Update on that study of p-hacking

Ron Berman writes:

I noticed you posted an anonymous email about our working paper on p-hacking and false discovery, but was a bit surprised that it references an early version of the paper.
We addressed the issues mentioned in the post more than two months ago in a version that has been available online since December 2018 (same link as above).

I wanted to send you a few thoughts on the post, with the hope that you will find them interesting and relevant to post on your blog as our reply, with the goal of informing your readers a little more about the paper.

These thoughts are presented in the separate section below.

The more recent analysis applies an MSE-optimal bandwidth selection procedure. Hence, we use only one single bandwidth for assessing the presence of a discontinuity at a specific level of confidence in an experiment.

Less importantly, the more recent analysis uses a triangular kernel and linear regression (though we also report a traditional logistic regression analysis result for transparency and robustness).
The results have not changed much, and have partially strengthened.

With regard to the RDD charts, the visual fit indeed might not be great. But we think the fit using the MSE-optimal window width is actually good.

The section below provides more details, and I hope you will find it relevant to post it on your blog.

We also of course would welcome any feedback you may have about the methods we are using in the paper, including the second part of the paper where we attempt to quantify the consequences of p-hacking on false discovery and foregone learning.

I am learning from every feedback we receive and am working to constantly improve the paper.

More details about blog post:

The comments by the anonymous letter writer are about an old version of the paper, and we have addressed them a few months ago.

Three main concerns were expressed: The choice of six possible discontinuities, the RDD window widths, and the RDD plot showing weak visual evidence of a discontinuity in stopping behavior based on the confidence level the experiment reaches.

1. Six hypotheses

We test six different hypotheses, each positing optional-stopping based on the p-value of the experiment at one of the three commonly used levels of significance/confidence in business and social science (90, 95 and 99%) for both positive and negative effects (3 X 2 = 6).
We view these as six distinct a-priori hypotheses, one each for a specific form of stopping behavior, not six tests of the same hypothesis.

2. RDD window width

The December 2018 version of the paper details an RDD analysis using an MSE-optimal bandwidth linear regression with a triangular kernel.
The results (and implications) haven’t changed dramatically using the more sophisticated approach, which relies on a single window for each RDD.

We fully report all the tables in the paper. This is what the results look like (Table 5 of the paper includes the details of the bandwidth sizes, number of observations etc):

The linear and the bias-corrected linear models use the “sophisticated” MSE-optimal method. We also report a logistic regression analysis with the same MSE-optimal window width for transparency and to show robustness.

All the effects are reported as marginal effects to allow easy comparison.

Not much has changed in results and the main conclusion about p-hacking remains the same: A sizable fraction of experiments exhibit credible evidence of stopping when the A/B test reaches 90% confidence for a positive effect, but not at the other levels of significance typically used and not for negative effects.

3. RDD plots

With respect to the RDD charts, the fit might indeed not look great visually. But what matters for the purpose of causal identification in such a quasi-experiment, in our opinion, is more the evidence of a discontinuity at the point of interest, rather than the overall data fit.

Here is the chart with the MSE-Optimal bandwidth around .895 confidence (presented as 90% to the experimenters) from the paper. Apart from the outlier at .89 confidence, we think the lines track the raw fractions rather well.

I wrote that earlier post in September, hence it was based on an earlier version of the article. It’s good to hear about the update.