Lee Nguyen Tran Kim Song Shimazaki

January 6, 2013

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Andrew Lee writes:

I am a recent M.A. graduate in sociology. I am primarily qualitative in method but have been moving in a more mixed-methods direction ever since I discovered sports analytics (Moneyball, Football Outsiders, Wages of Wins, etc.).

For my thesis I studied Korean-Americans in education in the health professions through a comparison of Asian ethnic representation in Los Angeles-area medical and dental schools.

I did this by counting up different Asian ethnic groups at UC Irvine, USC and Loma Linda University’s medical/dental schools using surnames as an identifier (I coded for ethnicity using an algorithm from the North American Association of Central Cancer Registries which correlated surnames with ethnicity: http://www.naaccr.org/Research/DataAnalysisTools.aspx). The coding was mostly easy, since “Nguyen” and “Tran” is always Vietnamese, “Kim” and “Song” is Korean, “Shimazaki” is Japanese, etc.

Now, the first time around I found that Chinese-Americans and Vietnamese-Americans were proportionally the most numerous Asian ethnics at the medical/dental schools of UC Irvine and USC (each about 10% of graduating classes), while Korean-Americans were a distant third (3-5%). At Loma Linda University, however, Korean-Americans were about 30% of dental school graduating classes every year and 20% of medical school graduates. Chinese- and Vietnamese-Americans, meanwhile, were in the low single digits at Loma Linda (Japanese never exceed 2% at any school, strangely enough).

These results were surprising because I had expected all three schools’ Asian ethnic representation to mirror that of the region’s demographics; I thought I had made a mistake so I decided to try recoding. What I did was that I decided that I might be subconsciously over-counting Korean-Americans by coding all instances of the “Lee” surname as Korean rather than Chinese. So I decided that unless there were “ethnic” first names which could be used to identify an individual as a particular ethnicity, I would label all ambiguous cases as simply Chinese.

So for example, a “Steve Lee” or “Jonathan Lee” would always be labeled as Chinese. I would only label a person with a “Lee” last name as Korean if their ethnic Korean given name was included as a marker (Korean and Chinese given names are quite distinct).

After the recoding, my results were almost the same, except that the percentage of Chinese at Loma Linda University’s medical school went up to almost 10%, and the proportions of Koreans at Loma Linda University went down to 25% in the dental school and 15% in the medical school.

Sorry for the long, boring paragraphs above on coding, but I just wanted to know if my methodology seems “rigorous” enough and if you would see these results as valid?

I want to make sure about these results, because they were the stepping ground off which I undertook a qualitative study of Asian-American medical/dental students at Loma Linda University to find why the demographics at Loma Linda were so different. I found that Loma Linda University being a Seventh-Day Adventist university was the key deciding factor; the fact that Loma Linda University is a religious university encouraged self-selection among applicants as well as potential applicants to its medical/dental programs. Many non-Adventist potential applicants to Loma Linda’s programs decided against applying because they were turned off by the university’s overtly religious self-presentation; Korean-Americans are a highly Christian ethnic group, on the other hand, and so are well-represented within Adventism. The fact that Korean immigrants to America are also usually at least middle class and college-educated/professionally-trained means that they had advantages in academic achievement. So middle-class Korean-American Adventists are overwhelmingly choosing the medical/dental route as a way of ensuring financial stability, raising their family’s social status, and also signaling religious commitment (Adventism places a heavy emphasis on physical health in its religious doctrine, so the health professions enjoy an extra level of prestige).

My reply: Interesting. It reminds me of this article by Ron Unz. I think there must be an academic literature on ethnic coding of names. If you really want to learn some statistics you can fit a model in which the ethnic status of each person is a discrete latent variable, and then estimate using the EM algorithm. But to start I think it makes sense to do what you’re doing, trying out various extreme assumptions to bound your answer.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science