No; I was all horns and thorns
Sprung out fully formed, knock-kneed and upright
— Joanna Newsom
Far be it for me to be accused of liking things. Let me, instead, present a corner of my hateful heart. (That is to say that I’m supposed to be doing a really complicated thing right now and I don’t want to so I’m going to scream into a void for a little while.)
The object of my ire: The 8-Schools problem.
Now, for those of you who aren’t familiar with the 8-schools problem, I suggest reading any paper by anyone who’s worked with Andrew (or has read BDA). It’s a classic.
So why hate on a classic?
Well, let me tell you. As you can well imagine, it’s because of a walrus.
I do not hate walruses (I only hate geese and alpacas: they both know what they did), but I do love metaphors. And while sometimes a walrus is just a walrus, in this case it definitely isn’t.
The walrus in question is the Horniman Walrus (please click the link to see my smooth boy!). The Horniman walrus is a mistake that you an see, for a modest fee, at a museum in South London.
The story goes like this: Back in the late 19th century someone killed a walrus, skinned it, hopefully did some other things to it, and sent it back to England to be stuffed and mounted. Now, it was the late 19th century and it turned out that the English taxidermist maybe didn’t know what a walrus looked like. (The museum’s website claims that “only a few people had ever seen a live walrus” at this point in history which, even for a museum, is really [expletive removed] white.)
But hey. He had a sawdust. He had glue. He had other things that are needed to stuff and mount a dead animal. So he took his dead animal and his tools, introduced them to each other and proudly displayed the results.
(Are you seeing the metaphor?)
Now, of course, this didn’t go well. Walruses, if you’ve never seen one, are huge creatures with loose skin folds. The taxidermist did not know this and so he stuffed the walrus full leading to a photoshop disaster of a walrus. Smooth like a beachball. A glorious mistake. And a genuine tourist attraction.
So this is my first problem. Using a problem like 8 schools as default test for algorithms has a tendency to lead to over-stuffed algorithms that are tailored to specific models. This is not a new problem. You could easily call it the NeurIPS Problem (aka how many more ways do you want to over-fit MNIST?). (Yes, I know NeurIPS has other problems as well. I’m focussing on this one.)
A different version of this problem is a complaint I remember from back in my former life when I cared about supercomputers. This was before the whole “maybe you can use big computers on data” revolution. In these dark times, the benchmarks that mattered were the speed at which you could multiply two massive dense matrices, and the speed at which you could do a dense LU decomposition of a massive matrix. Arguably neither of these things were even then the key use of high-performance computers, but as the metrics became goals, supercomputer architectures emerged that could only be used to their full capacity on very specialized problems that had enough arithmetic intensity to make use of the entire machine. (NB: This is quite possibly still true, although HPC has diversified from just talking about Cray-style architectures)
So my problem, I guess, is with benchmark problems in general.
A few other specific things:
Why so small? 8 Schools has 8 observations, which is not very many observations. We have moved beyond the point where we need to include the data in a table in the paper.
Why so meta? The weirdest thing about the 8 Schools problem is that it has the form
with appropriate priors on and
. The thing here is that the observation standard deviations
are known. Why? Because this is basically a meta-analysis. So 8-schools is a very specialized version of a Gaussian multilevel model. Buy fixing the observation standard deviation, the model has a much nicer posterior than the equivalent model with an unknown observation standard deviation. Hence, 8-schools doesn’t even test an algorithm on an ordinary linear mixed model.
But it has a funnel! So does Radford Neal’s funnel distribution (in more than 17 dimensions). Sample from that instead.
But it’s real data! Firstly, no it isn’t. You grabbed it out of a book. Secondly, the idea that testing inference algorithms on real data is somehow better than systematically testing on simulated data is just wrong. We’re supposed to be statisticians so let me ask you this: How does an algorithm’s success on real data set A generalize to the set of all possible data sets? (Hint: It doesn’t.)
So, in conclusion, I am really really really sick of seeing the 8-schools data set.