Hunting for “nominally” significant differences, trying different subgroups and multiple endpoints, can result in a much higher probability of erroneously inferring evidence of a risk or benefit than the nominal p-value, even in randomized controlled trials. This was an issue that arose in looking at RCTs in development economics (an area introduced to me by Nancy Cartwright), as at our symposium at the Philosophy of Science Association last month[i][ii]. Reporting the results of hunting and dredging in just the same way as if the relevant claims were predesignated can lead to misleading reports of actual significance levels.[iii]
Still, even if reporting spurious statistical results is considered “bad statistics,” is it criminal behavior? I noticed this issue in Nathan Schachtman’s blog over the past couple of days. The case concerns a biotech company, InterMune, and its previous CEO, Dr. Harkonen. Here’s an excerpt from Schachtman’s discussion (part 1).
In August 2002, Dr. Harkonen approved a press release, which carried a headline, “phase III data demonstrating survival benefit of Actimmune in IPF.” A subtitle announced the 70% relative reduction in patients with mild to moderate disease. ……
….The prosecution asserted that Dr. Harkonen engaged in data dredging, grasping for the right non-prespecified end point that had a low p-value attached. Such data dredging implicates the problem of multiple comparisons or tests, with the result of increasing the risk of a false-positive finding, notwithstanding the p-value below 0.05.
Supported by the testimony of Professor Thomas Fleming, who chaired the Data Safety Monitoring Board for the clinical trial in question, the government claimed that the trial results were “negative” because the p-values for all the pre-specified endpoints exceeded 0.05. Shortly after the press release, Fleming sent InterMune a letter that strongly dissented from the language of the press release, which he characterized as misleading. Because the primary and secondary end points were not statistically significant, and because the reported mortality benefit was found in a non-prespecified subgroup, the interpretation of the trial data required “greater caution,” and the press release was a “serious misrepresentation of results obtained from exploratory data subgroup analyses.”
The district court sentenced Harkonen to six months of home confinement, three years of probation, 200 hours of community service, and a fine of $20,000. Dr. Harkonen appealed on grounds that the federal fraud statutes do not permit the government to prosecute persons for expressing scientific opinions about which reasonable minds can differ. Unless no reasonable expert could find the defendant’s statement to be true, the trial court should dismiss the prosecution. Statements that have support even from a minority of the scientific community should not be the basis for a fraud charge. In Dr. Harkonen’s case, the government did not allege any misstatement of an objectively verifiable fact, but alleged falsity in his characterization of the data’s “demonstration” of an efficacy effect. The government cross-appealed to complain about the leniency of the sentence…..
And what about the claim that “Unless no reasonable expert could find the defendant’s statement to be true, the trial court should dismiss the prosecution.”?
[i] “Development: Knowing What Works, Evidence, Evaluation, and Experiment” (J. Burke, N. Cartwright, H. Seckinelgin, D. Fennel).
[ii] This has long been argued by Angus Deaton: http://www.nber.org/papers/w14690
[iii] I’m not sure if it’s required, but the FDA at least recommends that analytical plans be submitted prior to trial to avoid the effects of “trying and trying again”.
Filed under: PhilStatLaw, significance tests, spurious p values, Statistics
Please comment on the article here: Error Statistics Philosophy » Statistics