It seemed to me that most destruction was being done by those who could not choose between the two

September 12, 2017

(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)

Amateurs, dilettantes, hacks, cowboys, clones — Nick Cave

[Note from Dan 11Sept: I wanted to leave some clear air after the StanCon reminder, so I scheduled this post for tomorrow. Which means you get two posts (one from me, one from Andrew) on this in two days. That’s probably more than the gay face study deserves.]

I mostly stay away from the marketing aspect of machine learning / data science / artificial intelligence. I’ve learnt a lot from the methodology and theory aspects, but the marketing face makes me feel a little ill.  (Not unlike the marketing side of statistics. It’s a method, not a morality.)

But occasionally stories of the id and idiocy of that world wander past my face and I can’t ignore it any more. Actually, it’s happened twice this week, which—in a week of two different orientations, two different next-door frat parties until two different two A.M.s, and two grant applications that aren’t as finished as I want them to be—seems like two times too many to me.

The first one can be dispatched with quickly.  This idiocy came from IBM who are discovering that deep learning will not cure cancer.  Given the way that governments, health systems, private companies, and  research funding bodies are pouring money into the field, this type of snake oil pseudo-science is really quite alarming. Spectacularly well-funded hubris is still hubris and sucks money and oxygen from things that help more than the researcher’s CV (or bottom line).

In the end, the “marketing crisis” in data science is coming from the same place as the “replication crisis” in statistics.  The idea that you can reduce statistics and data analysis (or analytics) down to some small set of rules that can be automatically applied is more than silly, it’s dangerous. So just as the stats community has begun the job of clawing back the opportunities we’ve lost to p=0.05, it’s time for data scientists to control their tail.

Untouchable face

The second article occurred, as Marya Dmitriyevna (who is old-school) in Natasha, Pierre, and the Great Comet of 1812 shrieks after Anatole (who is hot) tries to elope with Natasha (who is young) , in my house. So I’m going to spend most of this post on that. [It was pointed out to me that the first sentence is long and convoluted, even for my terrible writing. What can I say? It’s a complicated Russian novel, everyone’s got nine different names.]

(Side note: The only reason I go to New York is Broadway and cabaret. It has nothing to do with statistics and the unrepeatable set of people in one place. [please don’t tell Andrew. He thinks I like statistics.])

It turns out that deep neural networks join the illustrious ranks of “men driving past in cars”, “angry men in bars”, “men on the internet”,  and “that one woman at a numerical analysis conference in far north Queensland in 2006” in feeling obliged to tell me that I’m gay.

Now obviously I rolled my eyes so hard at this that I almost pulled a muscle.  I marvelled at the type of mind who would decide “is ‘gay face’ real” would be a good research question. I also really couldn’t understand why you’d bother training a neural network. That’s what instagram is for. (Again, amateurs, dilettantes, hacks, cowboys, clones.)

But of course, I’d only read the headline.  I didn’t really feel the urge to read more deeply.  I sent a rolled eyes emoji to the friend who thought I’d want to know about the article and went back about my life.

Once, twice, three times a lady

But last night, as the frat party next door tore through 2am (an unpleasant side-effect of faculty housing it seems) it popped up again on my facebook feed. This time it had grown an attachment ot that IBM story.  I was pretty tired and thought “Right. I can blog on this”.

(Because that’s what Andrew was surely hoping for, three thousand words on whether “gay face” is a thing. In my defence, I did say “are you sure?” more than once and reminded him what happened that one time X let me post on his blog.)

(Side note: unlike odd, rambling posts about a papers we’d just written, posts about bad statistics published in psychology journals is right in Andrew’s wheelhouse.  So why aren’t I waiting until mid 2018 for a post on this that he may or may not have written to appear? [DS 11Sept: Ha!] There’s not much that I can be confident that I know a lot more about than Andrew does, but I am very confident I know a lot more about “gay face”.)

So I read the Guardian article, which (wonder of wonders!) actually linked to the original paper. Attached to the original paper was an authors’ note, and an authors’ response to the inevitable press release from GLAAD calling them irresponsible.

Shall we say that there’s some drift from the paper to the press reports. I’d be interested to find out if there was at any point a press release marketing the research.

The main part that has gone AWOL is a rather long, dystopian discussion that the authors have about how they felt morally obliged to publish this research because people could do this with “off the shelf” tools to find gay people in countries where its illegal.  (We will unpack that later)

But the most interesting part is the tension between footnote 10 of the paper (“The results reported in this paper were shared, in advance, with several leading international LGBTQ organizations”) and the GLAAD/HRC press release that says

 Stanford University and the researchers hosted a call with GLAAD and HRC several months ago in which we raised these myriad concerns and warned against overinflating the results or the significance of them. There was no follow-up after the concerns were shared and none of these flaws have been addressed.

With one look

If you strip away all of the coverage, the paper itself does some things right. It has a detailed discussion of the limitations of the data and the method. (More later)  It argues that, because facial features can’t be taught, these findings provide evidence towards the prenatal hormone theory of sexual orientation (ie that we’re “born this way”).

(Side note: I’ve never like the “born this way” narrative. I think it’s limiting.  Also, getting this way took quite a lot of work. Baby gays think they can love Patti and Bernadette at the same time. They have to learn that you need to love them with different parts of your brain or else they fight. My view is more “We’re recruiting. Apply within”.)

So do I think that this work should be summarily dismissed? Well, I have questions.

(Actually, I have some serious doubts, but I live in Canada now so I’m trying to be polite. Is it working?)

Behind the red door

Male facial image brightness correlates 0.19 with the probability of being gay, as estimated by the DNN-based classifier. While the brightness of the facial image might be driven by many factors, previous research found that testosterone stimulates melanocyte structure and function leading to a darker skin. (Footnote 6)

Again, it’s called instagram.

In the Authors’ notes attached to the paper, the authors recognise that “[gay men] take better pictures”, in the process acknowledging that they themselves have gay friends [honestly, there are too many links for that] who also possess this power. (Once more for those in the back: they’re called filters. Straight people could use them if they want. Let no photo go unmolested.)

(Side note: In 1989 Liza Minelli released a fabulous collaboration with the Petshop Boys that was apparently called Results because that’s what Janet Street-Porter used to call one of her outfits. I do not claim to understand straight men [nor am I looking to], but I have seen my female friends’ tinder. Gentlemen: you could stand to be a little more Liza.)

Back on top

But enough about methodology, let’s talk about data. Probably the biggest criticism that I can make of this paper is that they do not identify the source of their data. (Actually, in the interview with The Economist that originally announced the study, Kosinski says that this is intentional to “discourage copycats”.)

Obviously this is bad science.

(Side note: I cannot imagine the dating site in question would like its users to know that it allows its data to be scraped and analysed. This is what happens when you don’t read the “terms of service”, which I imagine the authors (and the ethics committee at Stanford) read very closely.)

[I’m just assuming that, as a study that used identifiable information about people who could not consent to being studied and deals with a sensitive study, this would’ve gone though an ethics process.]

This failure to disclose the origin of the data means we cannot contextualise it within the Balkanisation of gay desire. Gay men (at least, I will not speak for lesbians and I have nothing to say about bisexuals except “I thank you for your service”) will tell you what they want (what they really really want). This perverted and wonderful version of “The Secret” has manifested in multiple dating platforms that cater to narrow subgroups of the gay community.

Withholding information about the dating platform prevents people from independently scrutinising how representative the sample is likely to be. This is bad science.

(Side note: How was this not picked up by peer review? I’d guess it was reviewed by straight people. Socially active gay men should be able to pick a hole in the data in three seconds flat.  That’s the key point about diversity in STEM. It’s not about ticking boxes or meeting quotas. It’s that you don’t know what a minority can add to your work if they’re not in the room to add it.)

If you look at figure 4, the composite “straight” man is a trucker, while the composite “gay” man is a twunk. This strongly suggests that this is not my personal favourite type of gay dating site: the ones that caters to those of us who look like truckers. It also suggests that the training sample is not representative of the population the authors are generalising to. This is bad science.

(Terminology note: “Twunk” is the past form of “twink”, which is gay slang for a young (18-early20something), skinny, hairless gay man.)

The reality of gay dating websites is that you tailor your photographs to the target audience. My facebook profile picture (we will get to that later) is different to my grindr profile picture is different to my scruff profile picture. In the spirit of Liza, these are chosen for results. In these photos, I run a big part of the gauntlet between those two “composite ideals” in Figure 4. (Not all the way, because never having been a twink, and hence I never twunk.)

(Side note: The last interaction I had on Scruff was a two hour conversation about Patti LuPone. This is an “off-label” usage that is not FDA approved.)

So probably my biggest problem with this study is that it the training sample is likely unrepresentative of the population at large. This means that any inferences drawn from a model trained on this sample will be completely unable to answer questions about whether gay face is real in Caucasian Americans.  By withholding critical information about the data, the authors make it impossible to assess the extent of the problem.

One way to assess this error would be to take the classifier trained on their secret data and use it to, for example, classify face pics from a site like Scruff. There is a problem with this (as mentioned in the GLAAD/HRC press release) that activity and identity are not interchangeable. So some of the men who have sex with men (MSM, itself a somewhat controversial designation) will identify as neither gay nor bisexual and this is not necessarily information that would be available to the researcher. Nevertheless, it is probably safe to assume that people who have a publicly showing face picture on an dating app mainly used by MSMs are not straight.

If the classifier worked on this sort of data, then there is at least a chance that the findings of the study will replicate. But, those of you who read the paper of the Author notes howl, the authors did test the classifier on a validation sample gathered from Facebook.

At this point, let us pause to look at stills from Derek Jarman films

Pictures of you

First, we used the Facebook Audience Insights platform to identify 50 Facebook Pages most popular among gay men, including Pages such as: “I love being Gay”, “Manhunt”, “Gay and Fabulous”, and “Gay Times Magazine”. Second, we used the “interested in” field of users’ Facebook profiles, which reveals the gender of the people that a given user is interested in. Males that indicated an interest in other males, and that liked at least two out of the predominantly gay Facebook Pages, were labeled as gay.

I beseech you, in the Bowels of Christ, think it possible that that your validation sample may be biased.

(I mean, really. That’s one hell of an inclusion criterion.)

Rebel Girl / Nancy Boy

So what of the other GLAAD/HRC problems? They are mainly interesting to show the difference between the priorities of an advocacy organisation and statistical priorities. For example, the study only considered caucasians, which the press release criticises. The paper points out that there was not enough data to include people of colour. Without presuming to speak for LGB (the study didn’t consider trans people, so I’ve dropped the T+ letters [they also didn’t actively consider bisexuals, but LG sells washing machines]) people of colour, I can’t imagine that they’re disappointed to be left out of this specific narrative. That being said, the paper suggests that these results will generalise to other races. I am very skeptical of this claim.

Those composite faces also suggest that fat gay men don’t exist. Again, I am very happy to be excluded from this narrative.

What about the lesbians? Neural networks apparently struggle with lesbians. Again, it could be an issue of sampling bias. It could also be that the mechanical turk gender verification stage (4 of 6 people needed to agree with the person’s stated gender for them to be included) is adding additional bias.  The most reliable way to verify a person’s gender is to ask them.  I am uncertain why the researchers deviated from this principle. Certain sub-cultures in the LGB community identify as lesbian or gay but have either androgynous style, or purposely femme or butch style.  Systematically excluding these groups (I’m thinking particularly of butch lesbians and femme gays) will bias the results.

Breaking the law

So what of the dystopian picture the authors paint of governments using this type of procedure to find and eliminate homosexuals? Meh.

Just as I’m opposed to marketing in machine learning, I’m opposed to “story time” in otherwise serious research.  This claim makes the paper really exciting to read—you get to marvel at their moral dilemma. But the reality is much more boring. This is literally why universities (and I’m including Stanford in that category just to be nice) have ethics committees. I assume this study went through the internal ethics approval procedure, so the moralising is mostly pantomime.

The other reason I’m not enormously concerned about this is that I am skeptical of the idea that there is enough high-quality training data to train a neural network in countries where homosexuality is illegal (which are not white caucasian majority countries). Now I may well be very wrong about this, but I think we can all agree that it would be much harder than in the Caucasian case. LGBT+ people in these countries have serious problems, but neural networks are not one of them.

Researchers claiming that gay men, and to a slightly lesser extent lesbians, are structurally different from straight people is a problem. (I think we all know how phrenology was used to perpetuate racist myths.)  The authors acknowledge this during their tortured descriptions of their moral struggles over whether to publish the article. Given that, I would’ve expected the claims to be better validated. I just don’t believe that either their data set or their  validation set gives a valid representation of gay people.

All science should be good science, but controversial science should be unimpeachable. This study is not.

Live and let die (Gerri Halliwell version)

Ok. That’s more than enough of my twaddle. I guess the key point of this is that marketing and story telling, as well as being bad for science, gets in the way of effectively disseminating information.

Deep learning / neural networks / AI are all perfectly effective tools that can help people solve problems. But just like it wasn’t a generic convolutional neural network that wins games of Go, this tool only works when deployed by extremely skilled people. The idea that IBM could diagnose cancer with deep learning is just silly. IBM knows bugger all about cancer.

In real studies, selection of the data makes the difference between a useful step forward in knowledge and “junk science”.  I think there are more than enough problems with the “gay face” study (yes I will insist on calling it that) to be immensely skeptical. Further work may indicate that the authors’ preliminary results were correct, but I wouldn’t bet the farm on it. (I wouldn’t even bet a farm I did not own on it.) (In fact, if this study replicates I’ll buy someone a drink. [That someone will likely be me.])


Some things coming after the fact:

The post It seemed to me that most destruction was being done by those who could not choose between the two appeared first on Statistical Modeling, Causal Inference, and Social Science.

Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science

Tags: , ,