Controversies in the theory of measurement in mathematical psychology

We begin with this email from Guenter Trendler:

On your blog you wrote:

The replication crisis in social psychology (and science more generally) will not be solved by better statistics or by preregistered replications. It can only be solved by better measurement.

Check this out:

Measurement Theory, Psychology and the Revolution That Cannot Happen (pdf here)

The background is over 100 years of the theory and practice of measurement in psychology, which began (as I understand it, but bear in mind that I’ve never studied the history of these ideas) with the challenge of measuring subjective states. We can measure the length or weight of an object with a ruler or a scale, but how do you measure how loud a sound is, or how angry someone is, or how much something hurts? Or, to make things even more difficult, how to you measure someone’s verbal ability, their extraversion, their level of depression, or where they stand on some other scale of attitude or behavior? All these concepts are, to varying degrees, “real” (in the sense of being observable (even if only indirectly), reproducible, and corresponding to some external conditions) but can’t be measured directly.

Much has been written about the challenges of indirect measurement in psychology, and many of the resulting ideas have come up again in other fields such as sociology, economics, and political science.

How do you measure “social class,” “race,” “economic growth,” or “political ideology,” for example? One must define as well as measure. Even something as simple as the price of some good or service in the marketplace will depend on how you define it.

All this is well understood within psychometrics, with stochastic models used to estimate—and, implicitly, to define—latent constructs of interest such as abilities, attitudes, and mental states.

But, outside of psychometrics, in certain areas of research psychology that make the Psychological Science / PNAS / Ted talk/ NPR circuit, the subtleties of measurement don’t seem so well understood.

There often seems to be the attitude that, to learn about the connection between latent characteristics A and B, that any statistical significant correlation between observations x and y will do—as long as x can be considered in some way, however tenuous, to be a measurement of A, and y can be considered in some way to measure B. We’ve discussed lots and lots of such examples in this space, including fat arms, testosterone, power pose, life expectancy, and that study that labeled days 6-14 as the period of peak fertility. Many of the researchers in these studies didn’t seem to see the problem: they just (incorrectly) equated the measurements with the target of measurement and went from there.

It’s not wrong to use proxy measurements—in many cases, including much of my own work, all we have are proxy measurements!—but you should be aware of the challenge of going from measurement to what you’re trying to measure. If you want to criticize my political science work on the grounds that you can’t trust people’s responses to pollsters, fine. To defend my work, I’ll have to directly address the problems of measurement, and of course political scientists have been studying such issues for decades.

OK, that’s all background. On to Trendler’s papers. I can’t follow exactly what he’s saying. But it’s not just him, it’s the whole literature. I’m just not familiar enough with the terminology and concepts used in the psychological theory of measurement.

That said, I think Trendler might be on to something here. I say this because of what I see as the weakness in the opposition to his arguments, as I discuss next.

Trendler recently published a new paper, Conjoint measurement undone, in the journal Theory & Psychology:

According to classical measurement theory, fundamental measurement necessarily requires the operation of concatenation qua physical addition. Quantities which do not allow this operation are measurable only indirectly by means of derived measurement. Since only extensive quantities sustain the operation of physical addition, measurement in psychology has been considered problematic. In contrast, the theory of conjoint measurement, as developed in representational measurement theory, proposes that the operation of ordering is sufficient for establishing fundamental measurement. The validity of this view is questioned. The misconception about the advantages of conjoint measurement, it is argued, results from the failure to notice that magnitudes of derived quantities cannot be determined directly, i.e., without the help of associated quantitative indicators. This takes away the advantages conjoint measurement has over derived measurement, making it practically useless.

This appeared with two discussions, one by Joel Mitchell and one by Dave Krantz and Tom Wallsten.

Mitchell’s comment was pretty technical and I did not try to follow it all. The whole topic just seems so slippery to me. Indeed, even the wikipedia article on conjoint measurement was hard for me to follow. The topic may well be of fundamental importance, and maybe sometime in the future someone will sit down and explain it to me.

In their comment, Krantz and Wallsten made a larger statement about the replication crisis in psychology and elsewhere. They write:

Replication is abetted by statistical thinking, but not closely tied to it. It was important in science long before the burgeoning of statistics in the late 19th and the 20th century. . . . Roentgen’s discovery of X-rays used only an induction coil, a vacuum tube, cardboard for shielding, and a photographic plate; following his report (January 1, 1896) it was replicated within a month in many European and American laboratories (Pais, 1986, pp. 37–39). Tversky and Kahneman (1971) used a brief ques- tionnaire and an available pool of human respondents to discover that subjective bino- mial sampling distributions do not vary with stated sample size. One of us replicated this using 50 students (in a graduate statistics class) within weeks after receiving their draft manuscript and we have both since replicated it several times in classroom settings. The culture of replication depends on feasibility, habit of mind, and typical sizes of reported effects. . . .

Excellent point. Replication is fundamental and in many cases does not need to be tied to statistics at all.

Krantz and Wallsten also write:

In fact, both false alarms and low-power misses are statistically inevitable, rather than signs of pathology. Failure to accept this probabilistic viewpoint can contribute to a (false) feeling of crisis, and thence to unreasonable remedies. . . .

Also:

We are horrified by much of the statistical practice in psychology and other research. But so are many other critics. . . . Hardly anyone follows Trendler (2019; or Stevens, 1946) by asserting that development of interval-scale measurement is a prerequisite for statistical analysis. . . .

This does not fit together. Much of the statistical practice in psychology and other research is horrible? Check. Replication is important? Check. The feeling of crisis is “false”? Huh? A problem I see in Krantz and Wallsten is they talk about replication in terms of increasing sample sizes, without noting that improved measurement—better data—can be a key step. They write of “the inevitable tradeoff among effect size, sample size, and probability of missing something worthwhile”—but with better measurement we can learn more, with the new input being effort (to take better measurements) rather than sample size. Krantz and Wallsten come close when they write that “valid replication often requires theoretical understanding of the phenomenon in question,” but they don’t take the next step that this theoretical understanding can facilitate, and also come from, deeper measurements. I know that these researchers understand this in their applied work, but in this discussion they don’t seem to make the connection.

In
his response to the discussion, Trendler writes:

Unfortunately, the problem of measurability is not perceived as the primary cause of the failure to replicate, but what has been identified instead as the main issue is an inappropriate and dysfunctional use of established methods of statistical analysis. . . .

I agree with Trendler that improved statistical methods are not enough; we also need better measurement.

To summarize: I can’t evaluate the claims in the above discussion regarding conjoint measurement in psychology, as it involves many technical details that I have not tried to follow as I keep getting tangled in the details. For me to understand this debate, I think it would help if it were applied to problems in political science such as measurement of issue attitudes, political ideology, and partisanship, as estimated using survey responses, elections, votes, and political decisions. It does seem to be much of the controversial work in psychology that I’ve seen has serious problems with measurement, poor connection of measurement and experimental design to theory, and a general attitude that measurement doesn’t matter. So I do think that something needs to be done—and that “something” can’t just be increased sample size, exact replications of poorly-conceived studies, and improved statistical analysis.