S. Senn: Being a statistician means never having to say you are certain (Guest Post)

January 14, 2018

(This article was originally published at Statistics – Error Statistics Philosophy, and syndicated at StatsBlogs.)


Stephen Senn
Head of  Competence Center
for Methodology and Statistics (CCMS)
Luxembourg Institute of Health
Twitter @stephensenn

Being a statistician means never having to say you are certain

A recent discussion of randomised controlled trials[1] by Angus Deaton and Nancy Cartwright (D&C) contains much interesting analysis but also, in my opinion, does not escape rehashing some of the invalid criticisms of randomisation with which the literatures seems to be littered. The paper has two major sections. The latter, which deals with generalisation of results, or what is sometime called external validity, I like much more than the former which deals with internal validity. It is the former I propose to discuss.

The trouble starts, in my opinion, with the discussion of balance. Perfect balance is not, contrary to what is often claimed a necessary requirement for causal inference, nor is it something that randomisation attempts to provide. Conventional analyses of randomised experiments make an allowance for imbalance and that allowance is inappropriate if all covariates are balanced. If you analyse a matched-pairs design as if it were completely randomised, you fail that question in Stat 1. (At least if I am marking the exam.) The larger standard error for the completely randomised design is an allowance for the probable imbalance that such a design will have compared to a matched-pairs design.

This brings me on to another criticism. D&C discuss matching as if it were somehow an alternative to randomisation. But Fisher’s motto for designs can be expressed as, “block what you can and randomise what you can’t”. We regularly run cross-over trials, for example, in which there is blocking by patient, since every patient receives each treatment, and also blocking by period, since each treatment appears equally often in each period but we still randomise patients to sequences.

Part of their discussion recognizes this but elsewhere they simply confuse the issue, for example discussing randomisation as if it were an alternative to control. Control makes randomisation possible. Without control, there is no randomisation. Randomisation makes blinding possible, without randomisation there can be no convincing blinding. Thus in order of importance they are, control, randomisation and blinding but to set randomisation up as some alternative to control is simply misleading and unhelpful.

Elsewhere they claim, “the RCT strategy is only successful if we are happy with estimates that are arbitrarily far from the truth, just so long as errors cancel out over a series of imaginary experiments” but this is not what RCTs rely on. The mistake is in becoming fixated with the point estimate. This will, indeed be in error but any decent experiment and analysis will deliver an estimate of that error, as, indeed, they concede elsewhere. Being a statistician is never having to say you are certain. To prove a statistician is a liar you have to prove that the probability statement is wrong. That is harder than it may seem.

They correctly identify that when it comes to hidden covariates it is the totality of their effect that matters. In this, their discussion is far superior to the indefinitely many confounders argument that has been incorrectly proposed by others as being some fatal flaw. (See my previous blog Indefinite Irrelevance). However, they then undermine this by adding “but consider the human genome base pairs. Out of all those billions, only one might be important, and if that one is unbalanced, the result of a single trial can be ‘randomly confounded’ and far from the truth”. To which I answer “so what?”. To see the fallacy in this argument, which simultaneously postulates a rare event and conditions on its having happened, even though it is unobserved, consider the following. I maintain that if a fair die is rolled six times, the probability of six sixes in a row will be 1/46,656 and so rather rare. “Nonsense” say D&C, “suppose that the first five rolls have each produced a six, it will then happen one in six times and so is really not rare at all”.

I also consider that their simulation is irrelevant. They ask us to believe that if 100 samples of size 50 are taken from a log-Normal distribution and then for each sample, the values are permuted 1000 times to 25 in the control and 25 in the experimental group the type I error rate for a nominal 5% using the two-sample t-test will be 13.5%. In view of what is known about the robustness of the t-test under the null hypothesis (there is a long literature going back to Egon Pearson in the 1930s), this is extremely surprising and as soon as I saw it I disbelieved it. I simulated this myself using 2000 permutations, just for good measure, and found the distribution of type one error rates in the accompanying figure.

Each dot represents the type I error rate over 2000 permutations for one of the 100 samples. It can be seen that for most of the samples the proportion of significant t-tests is less than the nominal 5% and in fact, the average for the simulation is 4%. It is, of course, somewhat regrettable that some of the values are above 5% and, indeed, five of them have got a value of nearly 6% but if this worries you, the cure is at hand. Use a permutation t-test rather than a parametric one. (For a history of this approach, see the excellent book by Mielke et al [2].)  Don’t confuse the technical details of analysis with the randomisation. Whatever you do for analysis, you will be better off for having randomised whatever you haven’t blocked.

Why does my result differ from theirs? It is hard for me to work out exactly what they have done but I suspect that it is because they have assumed an impossible situation. They are allowing that the average treatment effect for the millions of patients that might have been included is zero but then sampling varying effects (that is to say the difference the treatment makes), rather than merely values (that is to say the reading for given patients), from this distribution. For any given sample the mean of the effects will not be zero and so the null-hypothesis will, as they point out, not be true for the sample, only for the population. But in analysing clinical trials we don’t consider this population. We have precise control of the allocation algorithm (who gets what if they are in the trial) and virtually none over the presenting process (who gets into the trial) and the null hypothesis that we test is that the effect is zero in the sample not in some fictional population. It may be that I have misunderstood what they are doing but I think that this is the origin of the difference.

This is an example of the sort of approach that led to Neyman’s famous dispute with Fisher. One can argue about the appropriateness of the Fisherian null hypothesis, “the treatments are the same”, but Neyman’s “the treatments are not the same but on average they are the same” is simply incredible[3]. As D&C’s simulation shows, as soon as you allow this, you will never find a sample for which it is true. If there is no sample for which it is true, what exactly are the remarkable properties of the population for which it is true? D&C refer to magical thinking about RCTs dismissively but this is straight out of some wizard’s wonderland.

My view is that randomisation should not be used as an excuse for ignoring what is known and observed but that it does deal validly with hidden confounders[4]. It does not do this by delivering answers that are guaranteed to be correct; nothing can deliver that. It delivers answers about which valid probability statements can be made and, in an imperfect world, this has to be good enough. Another way I sometimes put it is like this: show me how you will analyse something and I will tell you what allocations are exchangeable. If you refuse to choose one at random I will say, “why? Do you have some magical thinking you’d like to share?”


My research on inference for small populations is carried out in the framework of the IDeAl project http://www.ideal.rwth-aachen.de/ and supported by the European Union’s Seventh Framework Programme for research, technological development and demonstration under Grant Agreement no 602552.


  1. Deaton A, Cartwright N. Understanding and misunderstanding randomized controlled trials. Social Science and Medicine 2017.
  2. Berry KJ, Johnston JE, Mielke PWJ. A Chronicle of Permutation Statistical Methods. Springer International Publishing Limited Switzerland: Cham, 2014.
  3. Senn SJ. Added Values: Controversies concerning randomization and additivity in clinical trials. Statistics in Medicine 2004; 23: 3729-3753.
  4. Senn SJ. Seven myths of randomisation in clinical trials. Statistics in Medicine 2013; 32: 1439-1450.

Please comment on the article here: Statistics – Error Statistics Philosophy

Tags: , ,