Stephen Senn: Rothamsted Statistics meets Lord’s Paradox (Guest Post)


Stephen Senn
Consultant Statistician

The Rothamsted School

I never worked at Rothamsted but during the eight years I was at University College London (1995-2003) I frequently shared a train journey to London from Harpenden (the village in which Rothamsted is situated) with John Nelder, as a result of which we became friends and I acquired an interest in the software package Genstat®.

That in turn got me interested in John Nelder’s approach to analysis of variance, which is a powerful formalisation of ideas present in the work of others associated with Rothamsted. Nelder’s important predecessors in this respect include, at least, RA Fisher (of course) and Frank Yates and others such as David Finney and Frank Anscombe. John died in 2010 and I regard Rosemary Bailey, who has done deep and powerful work on randomisation and the representation of experiments through Hasse diagrams, as being the greatest living proponent of the Rothamsted School. Another key figure is Roger Payne who turned many of John’s ideas into code in Genstat®.

Lord’s Paradox

Lord’s paradox dates from 1967(1) and I wrote a paper(2) about it in Statistics in Medicine some years ago. It was reading The Book of Why(3) by Judea Pearl and Dana MacKenzie and its interesting account of Pearl’s important work in causal inference that revived my interest in it. I recommend The Book of Why but it has one rather irritating feature. It claims that all that statisticians ever did with causation is point out that correlation does not mean causation. I find this rather surprising, since very little of what I have ever done as a statistician has had anything to do with correlation but rather a lot with causation and I certainly don’t think that I am unusual in this respect. I thought it would be an interesting challenge to illustrate what the Rothamsted school, armed with Genstat®, might make of Lord’s paradox.

Interesting discussions of Lord’s paradox will be found not only in The Book of Why but in papers by Van Breukelen(4) and Wainer and Brown(5). However, the key paper is that of Holland and Rubin(6), which has been an important influence on my thinking. I shall consider the paradox in the Wainer and Brown form, in which it is supposed that we have a situation in which the effect of diet on the weight of students in two halls of residence (say 1 & 2), one providing diet A and the other Diet B, is considered. The mean weight at the start of observation in September is different between the two halls (it is higher in B than in A) but it differs by exactly the same amount, at the end of observation the following June and (but this is not necessary to the paradox) in fact, in neither hall has there been any change over time in mean weight.  The means at outcome are the same as the means at baseline. The four means are as given in the table below in which X stands for baseline and Y for outcome, with YBYA = XBXA = D.Although the mean weights per hall are the same at outcome as at baseline, some students have lost weight and some have gained weight, so the correlation between baseline and outcome is less than 1. However, the variances are the same at baseline and at outcome and, indeed, from one hall to another as indeed is the correlation. In further discussion I shall assume that we have the same number of students per hall.

Two statisticians (say John and Jane) now prepare to analyse the data. John uses a so-called change-score (the difference between weight at outcome and weight at baseline for every student). Once he has averaged the scores per hall, he will be calculating the difference

(YB XB)(YAXA) = ( YBYA) – (XBXA ) = DD = 0.

John thus concludes that there is no effect of diet on weight. Jane, on the other hand, proposes to use analysis of covariance. This is equivalent to ‘correcting’ each student’s weight at outcome by the within-halls regression of the weight at outcome on baseline. Since the variances at baseline and outcome are the same, this is equivalent to correcting the weights by the correlation coefficient, r. We can skip some of the algebra here but it turns out that Jane calculates

 ( YBYA) – r(XBXA ) = DrD = (1 – r)D,

which is not equal to zero unless r = 1.  However, that is not the case here and so a difference is observed. Jane, furthermore, finds that this is extremely significantly different from 0. Hence, Jane concludes that there is a difference between the diets.  Who is right?

Figure 1. Baseline and outcome weights for the students in Hall A (red circles) and Hall B (blue squares). The black line is the line of equality along which John adjusts weights and the red and blue lines are those fitted by Jane in an analysis of covariance and used to adjust weights.

A graphical representation is given in Figure 1, where we can see that if we adjust along the line of equality there is no difference between the halls but if we adjust using the within groups regression there is.

The Book of Why points out that the initial weight X is a confounding variable here and not a mediator stating, ‘Therefore, the second statistician would be unambiguously correct here.’ (P216) My analysis, however, is slightly different. Basically, I consider that the first statistician is unambiguously wrong but that the second statistician is not unambiguously right. Jane may be right but this depends on assumptions that need to be made explicit. I shall now explain why.

As Holland and Rubin point out, a key to understanding the paradox is to try and think causally: is there a causal question and if so what does it imply? The way I usually try and understand these things is by imagining what I would do if it were a reasonable experiment, which, as we shall see, it is not and then consider what further adjustments are necessary.

Genstat® versus Lord’s paradox

So let us first of all assume that in order to understand the effects of diet, each of the two halls had been randomised to the diet. What would a reasonable analysis be? I shall start the investigation by considering outcomes only and see what John Nelder’s theory of the analysis of experiments(7, 8), as encoded in Genstat® would lead us to conclude. I shall then consider the role of the baseline values. I assume, just for illustration, that we have 100 students per hall and have created a data-set with exactly this situation:  two halls, one diet per hall, 100 students per hall.

To analyse the structure, Genstat®, requires me to declare the block structure first. This is how the experimental material is organised before anything is done to it.  Here we have students nested within halls. This is indicated as follows


Here “/” is the so-called nesting operator. Next, I have to inform the program of the treatment structure. This is quite simple in this case. There is only one treatment and that is Diet. So I write


Note that this difference between blocking and treatment structure is fundamental to John Nelder’s approach and, indeed, where not explicit, implicit in the whole approach of the Rothamsted school to designed experiments and thus to Genstat®. Without taking anything away from the achievements of the causal revolution outlined in The Book of Why it is interesting to note an analogy to the crucial difference between see (block structure) and do (treatment structure) in Pearl’s theory.

Next I need to inform Genstat® what the outcome variable is via an ANOVA statement, for example,

ANOVA Weight

but if I don’t, and just write


 all that it will do is produce a so-called null analysis of variance as follows:

Analysis of variance

Source of variation                        d.f.

Hall stratum

Diet                                                   1

Hall.Student stratum                     198

Total                                              199


This immediately shows a problem. The problem is, in a sense obvious, and I am sure many a reader will consider that I have taken a sledgehammer to crack a nut but in my defence I can say, however obvious it is, it appears to have been rather overlooked in the discussion of Lord’s paradox. The problem is that, as any statistician can tell you, it is not enough to produce an estimate, you also have to produce an estimate of how uncertain the estimate is. Genstat® tells me here that this is impossible. The block structure defines two strata: the hall stratum and the student-within-hall stratum (here indicated by Hall.Student). There is only one degree of freedom in the first stratum but unfortunately the treatment appears in this stratum and competes for this single degree of freedom with what has to be used for error variation, namely the difference between halls. There is nothing that can be said using the data only about how precisely we have estimated the effect of diet and, if this is the case, the estimate is useless. The problem is that what we have is the structure of what in a clinical trials context would be called a cluster randomised trial with only two clusters.

What happens if I re-arrange my experiment to deal with this? Let us accept an implicit practical constraint that we cannot allocate students in the same hall to different diets but let us suppose that we can recruit more halls. Suppose that I could recruit 20 halls, ten for each diet with the same number of students studied in total, so that each hall provides ten students and, as before, I have 200 students. The Genstat® null ANOVA now looks like this.

Analysis of variance

Source of variation                        d.f.

Hall stratum

Diet Experiment 1                           1

Residual                                        18

Hall.Student stratum                    180

Total                                             199

We can see now, even more clearly, that it is the hall stratum that provides the residual variation with which we can estimate the precision of the treatment estimate and furthermore that whatever the contribution of studying students may be to making our experiment more precise, we cannot use the degrees of freedom within halls to estimating how precise it will be unless we can declare that the contribution to the overall variance of halls is zero. This is an important point to remember because it is now time we considered the baseline weight.

Suppose that we now stop to compare John’s and Jane’s estimates in terms of our improved experiment. Given that we have more information (if the variance between halls is important) we ought to expect to find that the values will differ less. First note that only the term ( YBYA) can reflect the difference in diets and that this term is the same for both John and Jane’s estimate. Therefore, any convergence of John’s and Jane’s estimate is not because the term ( YBYA) will estimate the causal effect of diet better, although it may very reasonably be expected to do so, since that virtue is reflected in both their approaches. Note also that diet can only affect this term, since the terms  XBXA involving occur before the dietary intervention.

No. The reason that we may expect some convergence is that although the correction term, involving the baselines is not the same for both statisticians, for both it is a multiple of the same difference. For John we have ( XBXA)  and for Jane we have  r( XBXA)   and the difference between the two is (1 – r)( XBXA) and this may be expected to get smaller as the number of halls increases. In fact, over all randomisations it is zero and if we keep the number of students per hall constant but increase the number of hall it approaches zero, so that in some sense we can regard both John and Jane as measuring the same effect marginally.

Now, it is certainly my point of view(9) that we should not be satisfied with such marginal arguments, although they should always be considered because they are calibrating. The consequence of this is that although marginal inferences do not trump conditional ones, if you get your marginal inference wrong you will almost certainly do the same with your conditional one.  But suppose, we have a large number of halls but notice some particular difference at baseline in weights between the two groups of halls, be it large or be it small. What should we do about it? It turns out that if we can condition on this difference appropriately we will have an estimate that is a) independent of the observed difference and b) efficient (that is to say has a small variance). It is also known that a way to do this, that works, asymptotically at least, is analysis of covariance. So we should adjust the difference in weights at outcome using the differences in weights at baseline in an analysis of covariance. Doesn’t this get us back to Jane’s solution?

Not quite. The relevant difference we would observe is the relevant difference between the groups of halls. What Jane was proposing to do was to use the correlation within halls to correct a difference between. However, we have already seen that the Hall.Student stratum is not relevant for judging the variance of the outcomes. Can it automatically be right for judging the covariance? No. It might be an assumption one chooses to make but it will be a choice and it certainly cannot be said that this choice would be unambiguously right. If we just rely on the data, then Genstat® will have the baseline covariate entering in both strata, that is to say not only within halls but between.

Thus, my conclusion is that Jane’s analysis could be right if the within-hall variances and covariances are relevant to the variation across halls. They might be relevant but it is far from obvious that they must be and it therefore does not follow that Jane’s argument is unambiguously right.

Of course, what I described was what one would decide for an experiment. You may choose to disagree that such a supposedly randomised experiment could provide any guidance for something quasi-experimental such as Lord’s paradox. After all, we are not told that the diets were randomised to the halls. I am not so sure. I think that in this case, at least, the quasi-experimental set-up inherits the problems that the similar randomised experiment would show. I think that it is far from obvious that what Jane proposes to do is unambiguously right.

Whether you agree or disagree, I hope I have succeeded in showing you that statistical theory, and in particular careful examination of variation, a topic initiated by RA Fisher one hundred years ago(10) and for which he proposed the squared measure and gave it the name variance, goes beyond merely warning that correlation is not causation. Sometimes correlation isn’t even correlation.

(Associated slides are below.)



  1. Lord FM. A paradox in the interpretation of group comparisons. Psychological Bulletin. 1967;66:304-5.
  2. Senn SJ. Change from baseline and analysis of covariance revisited. Statistics in Medicine. 2006;25(24):4334–44.
  3. Pearl J, Mackenzie D. The Book of Why: Basic Books; 2018.
  4. Van Breukelen GJ. ANCOVA versus change from baseline had more power in randomized studies and more bias in nonrandomized studies. Journal of clinical epidemiology. 2006;59(9):920-5.
  5. Wainer H, Brown LM. Two statistical paradoxes in the interpretation of group differences: Illustrated with medical school admission and licensing data. American Statistician. 2004;58(2):117-23.
  6. Holland PW, Rubin DB. On Lord’s Paradox. In: Wainer H, Messick S, editors. Principals of Modern Psychological Measurement. Hillsdale, NJ: Lawrence Erlbaum Associates; 1983.
  7. Nelder JA. The analysis of randomised experiments with orthogonal block structure I. Block structure and the null analysis of variance. Proceedings of the Royal Society of London Series A. 1965;283:147-62.
  8. Nelder JA. The analysis of randomised experiments with orthogonal block structure II. Treatment structure and the general analysis of variance. Proceedings of the Royal Society of London Series A. 1965;283:163-78.
  9. Senn SJ. Seven myths of randomisation in clinical trials. Statistics in Medicine. 2013;32(9):1439-50.
  10. Fisher RA. The correlation between relatives on the supposition of Mendelian inheritance. Transactions of the Royal Society of Edinburgh. 1918;52:339-433.