Do regression structures affect research capital? The case of pronoun drop

A linguist pointed me with incredulity to this article by Horst Feldmann, “Do Linguistic Structures Affect Human Capital? The Case of Pronoun Drop,” which begins:

This paper empirically studies the human capital effects of grammatical rules that permit speakers to drop a personal pronoun when used as a subject of a sentence. By de‐emphasizing the significance of the individual, such languages may perpetuate ancient values and norms that give primacy to the collective, inducing governments and families to invest relatively little in education because education usually increases the individual’s independence from both the state and the family and may thus reduce the individual’s commitment to these institutions. Carrying out both an individual‐level and a country‐level analysis, the paper indeed finds negative effects of pronoun‐drop languages. The individual‐level analysis uses data on 114,894 individuals from 75 countries over 1999‐2014. It establishes that speakers of such languages have a lower probability of having completed secondary or tertiary education, compared with speakers of languages that do not allow pronoun drop. The country‐level analysis uses data from 101 countries over 1972‐2012. Consistent with the individual‐level analysis, it finds that countries where the dominant languages permit pronoun drop have lower secondary school enrollment rates. In both cases, the magnitude of the effect is substantial, particularly among females.

Another linguist saw this paper and asked if it was a prank.

I don’t think it’s a prank. I think it’s serious.

It would be easy, and indeed reasonable, to just laugh at this one and move on, to file it along other cross-country comparisons such as this—but I thought it could be instructive instead to take the paper seriously and see what went wrong.

I’m hoping these steps can be useful to students when trying to understand published research. Or, for that matter, when trying to understand their own regression.

So how can we figure out what’s really going on in this article?

To start with, the claimed effect is within-person (speaking a certain type of language affects your behavior) and within-country (speaking a certain type of language affects national values and norms), but all the data are observational and all the comparisons are between people and between countries. Thus, any causal interpretations are tenuous at best.

So we can start by rewriting the above abstract in descriptive terms. I’ll just repeat the empirical parts, and for convenience I’ll put my changes in bold

This paper empirically studies the correlation of human capital with grammatical rules that permit speakers to drop a personal pronoun when used as a subject of a sentence. . . Carrying out both an individual‐level and a country‐level analysis, the paper indeed finds negative correlations of pronoun‐drop languages with outcomes of interest after adjusting for various demographic variables. . . . speakers of such languages have a lower probability of having completed secondary or tertiary education, compared with speakers of languages that do not allow pronoun drop. The country‐level analysis uses data from 101 countries over 1972‐2012. Consistent with the individual‐level analysis, it finds that countries where the dominant languages permit pronoun drop have lower secondary school enrollment rates. In both cases, the magnitude of the correlation is substantial, particularly among females.

OK, that helps a little.

Now we have to dig in a bit more. First, what’s a pronoun-drop language? Or, more to the point, which languages have pronoun drop and which don’t? I looked through the paper for a list of these languages ora map of where they are spoken. I didn’t see such a list or map, so I went to wikipedia and found this:

Among major languages, two of which might be called a pro-drop language are Japanese and Korean (featuring pronoun deletion not only for subjects, but for practically all grammatical contexts). Chinese, Slavic languages, and American Sign Language also exhibit frequent pro-drop features. In contrast, non-pro-drop is an areal feature of many northern European languages (see Standard Average European), including French, (standard) German, and English. . . . Most Romance languages (with the notable exception of French) are often categorised as pro-drop too, most of them only in the case of subject pronouns . . . Among the Indo-European and Dravidian languages of India, pro-drop is the general rule . . . Outside of northern Europe, most Niger–Congo languages, Khoisan languages of Southern Africa and Austronesian languages of the Western Pacific, pro-drop is the usual pattern in almost all linguistic regions of the world. . . . In many non-pro-drop Niger–Congo or Austronesian languages, like Igbo, Samoan and Fijian, however, subject pronouns do not occur in the same position as a nominal subject and are obligatory, even when the latter is present. . . .

Hmmmm, now things don’t seem so clear. Much will depend on how the languages are categorized.

The next thing we need, after we have a handle on the data, is a scatterplot. Actually a bunch of scatterplots. A scatterplot for each within-country analysis and a scatterplot for the between-country analysis. Outcome of interest on y-axis, predictor of interest on x-axis. OK, the within-country data will have to be plotted in a different way because the predictor and outcome are discrete, but something can be done there.

The point is, we need to see what’s going on. In the within-country analysis, where do we see this correlation and where do we not see it? In the between-country analysis, what countries are driving the correlation?

Again, the analysis is all descriptive, and that’s fine, but the point is we need to understand what we’re describing.

I have no idea if the causal claims in this paper are true—given what I’ve seen so far, I see no particular reason to believe the claims. But, in any case, if these patterns are interesting—and I have no idea on that either—then they’re worth understanding. The regression won’t give us understanding; it just chews up the data and gives meaningless claims such as “we find that the magnitude of the effect is substantial and slightly larger for women. Specifically, women who speak a pronoun drop language are 9‐11 percentage points less likely to have completed secondary or tertiary education than women who speak a non‐pronoun drop language. For men, the probability is 8‐10 percentage points.” That way lies madness. We—Science—can do better.

P.S. I scrolled down to the end of the paper and found this sentence which begins the final footnote:

Pronoun drop rules are not perfect measures of ancient collectivism.

Ya think? In all seriousness, who could think that pronoun drop rules are any sort of measure of “ancient collectivism” at all? As Bertrand Russell said, this is one of those views which are so absurd that only very learned men could possibly adopt them.