A debate about effect-size variation in psychology: Simmons and Simonsohn; McShane, Böckenholt, and Hansen; Judd and Kenny; and Stanley and Doucouliagos

A couple weeks ago, Uri Simonsohn and Joe Simmons sent me and others a note that they were writing a blog post citing some of our work and asking for us to point out anything that we find “inaccurate, unfair, snarky, misleading, or in want of a change for any reason.”

I took a quick look and decided that my part in this was small enough that I didn’t really have anything to say. But some of my colleagues did have reactions which they shared with the blog authors. Unfortunately, Simonsohn and Simmons did not want to post these replies on their blog or to link to them, so my colleagues asked me to post something here. So that’s what I’m doing.

1. Post by Joe Simmons and Uri Simonsohn

This is the post that started it all, and it begins:

A number of authors have recently proposed that (i) psychological research is highly unpredictable, with identical studies obtaining surprisingly different results, (ii) the presence of heterogeneity decreases the replicability of psychological findings. In this post we provide evidence that contradicts both propositions.

Consider these quotes:

“heterogeneity persists, and to a reasonable degree, even in […] Many Labs projects […] where rigid, vetted protocols with identical study materials are followed […] heterogeneity […] cannot be avoided in psychological research—even if every effort is taken to eliminate it.”
McShane, Tackett, Böckenholt, and Gelman (American Statistician 2019 .pdf)

“Heterogeneity […] makes it unlikely that the typical psychological study can be closely replicated”
Stanley, Carter, and Doucouliagos (Psychological Bulletin 2018 .pdf)

“Repeated investigations of the same phenomenon [get] effect sizes that vary more than one would expect […] even in exact replication studies. […] In the presence of heterogeneity, […] even large N studies may find a result in the opposite direction from the original study. This makes us question the wisdom of placing a great deal of faith in a single replication study”
Judd and Kenny (Psychological Methods 2019 .pdf)

This post is not an evaluation of the totality of these three papers, but rather a specific evaluation of the claims in the quoted text. . .

2. Response by Blakeley McShane, Ulf Böckenholt, and Karsten Hansen

I wish Simmons and Simonsohn had just linked to this, but since they didn’t, here it is. And here’s the summary that McShane, Böckenholt, and Hansen wrote for me to post here:

We thank Joe and Uri for featuring our papers in their blogpost and Andrew for hosting a discussion of it. We keep our remarks brief here but note that (i) the longer comments that we sent Joe and Uri before their post went live are available here (they denied our request to link to this from their blogpost) and (ii) our “Large-Scale Replication” paper that discusses many of these issues in greater depth (especially on page 101) is available here.

A long tradition has argued that heterogeneity is unavoidable in psychological research. Joe and Uri seem to accept this reality when study stimuli are varied. However, they seem to categorically deny it when study stimuli are held constant but study contexts (e.g., labs in Many Labs, waves in their Maluma example) are varied. Their view seems both dogmatic and obviously false (e.g., should studies with stimuli featuring Michigan students yield the same results when conducted on Michigan versus Ohio State students? Should studies with English-language stimuli yield the same results when conducted on English speakers versus non-English speakers?). And, even in their own tightly-controlled Maluma example, the average difference across waves is ≈15% of the overall average effect size.

Further, the analyses Joe and Uri put forth in favor of their dogma are woefully unconvincing to all but true believers. Specifically, their analyses amount to (i) assuming or forcing homogeneity across contexts, (ii) employing techniques with weak ability to detect heterogeneity, and (iii) concluding in favor of homogeneity when the handicapped techniques fail to detect heterogeneity. This is not particularly persuasive, especially given that these detection issues are greatly exacerbated by the paucity of waves/labs in the Maluma, Many Labs, M-Turk, and RRR data and the sparsity in the Maluma data which result in low power to detect and imprecise estimates of heterogeneity across contexts.

Joe and Uri also seem to misattribute to us the view that psychological research is in general “highly unpredictable” and that this makes replication hopeless or unlikely. To be clear, we along with many others believe exact replication is not possible in psychological research and therefore (by definition) some degree of heterogeneity is inevitable. Yet, we are entirely open to the idea that certain paradigms may evince low heterogeneity across stimuli, contexts, or both—perhaps even so low that one may ignore it without introducing much error (at least for some purposes if not all). However, it seems clearly fanatical to impose the view that heterogeneity is zero or negligible a priori. It cannot be blithely assumed away, and thus we have argued it is one of the many things that must be accounted for in study design and statistical analysis whether for replication or more broadly.

But, we would go further: heterogeneity is not a nuisance but something to embrace! We can learn much more about the world by using methods that assess and allow/account for heterogeneity. And, heterogeneity provides an opportunity to enrich theory because it can suggest the existence of unknown or unaccounted-for moderators.

It is obvious and uncontroversial that heterogeneity impacts replicability. The question is not whether but to what degree, and this will depend on how heterogeneity is measured, its extent, and how replicability is operationalized in terms of study design, statistical findings, etc. A serious and scholarly attempt to investigate this is both welcome and necessary!

3. Response by Charles Judd and David Kenny

Judd and Kenny’s response goes as follows:

Joe Simmons and Uri Simonsohn attribute to us (Kenny & Judd, 2019) a version of effect size heterogeneity that we are not sure we recognize. This is largely because the empirical results that they show seem to us perfectly consistent with the model of heterogeneity that we thought we had proposed. In the following we try to clearly say what our heterogeneity model really is and how Joe and Uri’s data seem to us consistent with that model.

Our model posits that an effect size from any given study, 𝑑_i, estimates some true effect size, 𝛿_i, and that these true effect sizes have some variation, 𝜎_𝛿, around their mean, 𝜇_𝛿. What might be responsible for this variation (i.e., the heterogeneity of true effect sizes)? There are many potential factors, but certainly among such factors are procedural variations of the sort that Joe and Uri include in the studies they report.

In the series of studies Joe and Uri conducted, participants are shown two shapes, one more rounded and one more jagged. Participants are then given two names, one male and one female, and asked which name is more likely to go with which shape. Across studies, different pairs of male and female names are used, but always with the same two shapes.

What Joe and Uri report is that across all studies there is an average effect (with the female name of the pair being seen as more likely for the rounded shape), but that the effect sizes in the individual studies vary considerably depending on which name pair is used in any particular study. For instance, when the name pair consists of Sophia and Jack, the effect is substantially larger than when the name pair consists of Liz and Luca.

Joe and Uri then replicate these studies a second time and show that the variation in the effect sizes across the different name-pairs is quite replicable, yielding a very substantial correlation of the effect sizes between the two replications, computed across the different name-pairs.

We believe that our model of heterogeneity can fully account for these results. The individual name-pairs each have a true effect size associated with them, 𝛿_i, and these vary around their grand mean 𝜇_𝛿. Different name-pairs produce heterogeneity of effect sizes. Name-pairs constitute a random factor that moderates the effect sizes obtained. It most properly ought to be incorporated into a single analysis of all the obtained data, across all the studies they report, treating it and participants as factors that induce random variation in the effect of interest (Judd, Kenny, & Westfall, 2012; 2017). . . .

The point is that there are a potentially a very large number of random factors that may moderate effect sizes and that may vary from replication attempt to replication attempt. In Joe and Uri’s work, these other random factors didn’t vary, but that’s usually not the case when one decides to replicate someone else’s effect. Sample selection methods vary, stimuli vary in subtle ways, lighting varies, external conditions and participant motivation vary, experimenters vary, etc. The full list of potential moderators is long and perhaps ultimately unknowable. And heterogeneity is likely to ensue. . . .

4. Response by T. D. Stanley and Chris Doucouliagos

And here’s what Stanley and Doucouliagos write:

Last Fall, MAER-Net (Meta-Analysis of Economics Research-Network) had a productive discussion about the replication ‘crisis,’ and how it could be turned into a credibility revolution. We examined the high heterogeneity revealed by our survey of over 12,000 psychological studies and how it implies that close replication is unlikely (Stanley et al., 2018). Marcel van Assen pointed out that the then recently-released, large-scale, multi-lab replication project, Many Labs 2 (Klein et al., 2018), “hardly show heterogeneity,” and Marcel claimed “it is a myth (and mystery) why researchers believe heterogeneity is omnipresent in psychology.”

Supporting Marcel’s view is the recent post by Joe Simmons and Uri Simonsohn about a series of experiments that are directly replicated a second time using the same research protocols. They find high heterogeneity across versions of the experiment (I^2 = 79%), but little heterogeneity across replications of the exact same experiment.

We accept that carefully-conducted, exact replications of psychological experiments can produce reliable findings with little heterogeneity (MAER-Net). However, contrary to Joe and Uri’s blog, such modest heterogeneity from exactly replicated experiments is fully consistent with the high heterogeneity that our survey of 200 psychology meta-analyses finds and its implication that “it (remains) unlikely that the typical psychological study can be closely replicated” . . .

Because Joe and Uri’s blog was not pre-registered and concerns only one idiosyncratic experiment at one lab, we focus instead on ML2’s pre-registered, large-scale replication of 28 experiments across 125 sites, addressing the same issue and producing the same general result. . . . ML2 focuses on measuring the “variation in effect magnitudes across samples and settings” (Klein et al., 2018, p. 446). Each ML2 experiment is repeated at many labs using the same methods and protocols established in consultation with the original authors. After such careful and exact replication, ML2 finds only a small amount of heterogeneity remains across labs and settings. It seems that psychological phenomenon and the methods used to study them are sufficiently reliable to produce stable and reproducible findings. Great news for psychology! But this fact does not conflict with our survey of 200 meta-analyses nor its implications about replications (Stanley et al., 2018).

In fact, ML2’s findings collaborate both the high heterogeneity our survey finds and its implication that typical studies are unlikely to be closely replicated by others. Both high and little heterogeneity at the same time? What explains this heterogeneity in heterogeneity?

First, our survey finds that typical heterogeneity in an area of research is 3 times larger than sampling error (I^2 = 74%; std dev = .35 SMD). Stanley et al. (2018) shows that this high heterogeneity makes it unlikely that the typical study will be closely replicated (p. 1339), and ML2 confirms our prediction!

Yes, ML2 discovers little heterogeneity among different labs all running the exact same replication, but ML2 also finds huge differences between the original and replicated effect sizes . . . If we take the experiments that ML2 selected to replicate as ‘typical,’ then it is unlikely that this ‘typical’ experiment can be closely replicated. . . .

Heterogeneity may not be omnipresent, but it is frequently: seen among published research results, identified in meta-analyses, and confirmed by large-scale replications. As Blakeley, Ulf and Karsten reminds us, heterogeneity has important theoretical implications, and it can also be identified and explained by meta-regression analysis.

5. Making sense of it all

I’d like to thank all parties involved—Simmons and Simonsohn; McShane, Böckenholt, and Hansen; Judd and Kenny; and Stanley and Doucouliagos—for their contributions to the discussion.

On the substance of the matter, I agree with Judd and Kenny and Stanley and Doucouliagos that effects do vary—indeed, just about every psychology experiment you’ll ever see is a study of a two-way or three-way interaction, hence varying effects are baked into the discussion—and it would be a mistake to consider variation as zero, just because it’s hard to detect variation in a particular dateset (echoes of the hot-hand fallacy fallacy!).

I’ve sounded the horn earlier on the statistical difficulties of estimating treatment effect variation, so I also see where Simmons and Simonsohn are coming from, pointing out that apparent variation can be much larger than underlying variation. Indeed, this relates to the justly celebrated work of Simmons, Nelson, and Simonsohn on researcher degrees of freedom and “false positive psychology”: The various overestimated effects in the Psychological Science / PNAS canon can be viewed as a big social experiment in which noisy data and noisy statistics were, until recently, taken as evidence that we live in a capricious social world, and that we’re buffeted by all sorts of large effects.

Simmons and Simonsohn’s efforts to downplay overestimation of effect-size variability is, therefore, consistent with their earlier work on downplaying overestimates of effect sizes. Remember: just about every effect being studied in psychology is an interaction. So if an effect size (e.g., ovulation and voting) was advertised as 20% but is really, say, 0.2%, then that’s really an effect-size heterogeneity that’s being scaled down.

I also like McShane, Böckenholt, and Hansen’s remark that “heterogeneity is not a nuisance but something to embrace.”

6. Summary

On the technical matter, I agree with the discussants that it’s a mistake to think of effect-size variation as negligible. Indeed, effect-size variation—also called “interactions”—is central to psychology research. At the same time, I respect Simmons and Simonsohn’s position that effect-size variation is not as large as it’s been claimed to be—that’s related to the uncontroversial statement that Psychological Science and PNAS have published a lot of bad papers—and that overrating of the importance of effect-size variation has led to lots of problems.

It’s too bad that Uri and Joe’s blog doesn’t have a comment section; fortunately we can have the discussion here. All are welcome to participate.