# Dan’s Paper Corner: Yes! It does work!

Only share my research
With sick lab rats like me
Trapped behind the beakers
And the Erlenmeyer flasks
Cut off from the world, I may not ever get free
But I may
One day
Trying to find
An antidote for strychnine — The Mountain Goats

Hi everyone! Hope you’re enjoying Peak Libra Season! I’m bringing my Air Sign goodness to another edition of Dan’s Paper Corner, which is a corner that I have covered in papers I really like.

And honestly, this one is mostly cheating. Two reasons really. First, it says nice things about the work Yuling, Aki, Andrew, and I did and then proceeds to do something much better. And second because one of the authors is Tamara Broderick, who I really admire and who’s been on an absolute tear recently.

Tamara—often working with the fabulous Trevor Campbell (who has the good grace to be Canadian), the stunning Jonathan Huggins (who also might be Canadian? What am I? The national register of people who are Canadian?), and the unimpeachable Ryan Giordano (again. Canadian? Who could know?)—has written a pile of my absolute favourite recent papers on Bayesian modelling and Bayesian computation.

Here are some of my favourite topics:

As I say, Tamara and her team of grad students, postdocs, and co-authors have been on one hell of a run!

Which brings me to today’s paper: Practical Posterior Error Bounds from Variational Objectives by Jonathan Huggins, Mikołaj Kasprzak, Trevor Campbell, and Tamara Broderick.

In the grand tradition of Dan’s Paper Corner, I’m not going to say much about this paper except that it’s really nice and well worth reading if you care about asking “Yes, but did it work?” for variational inference.

I will say that this paper is amazing and covers a tonne of ground. It’s fully possible that someone reading this paper for the first time won’t recognize how unbelievably practical it is. It is not trying to convince you that its new melon baller will ball melons faster and smoother than your old melon baller. Instead it stakes out much bolder ground: this paper provides a rigorous and justified and practical workflow for using variational inference to solve a real statistical problem.

I have some approximately sequential comments below, but I cannot stress this enough: this is the best type of paper. I really like it. And while it may be of less general interest than last time’s general theory of scientific discovery, it is of enormous practical value. Hold this paper close to your hearts!

• On a personal note, they demonstrate that the idea in the paper Yuling, Aki, Andrew, and I wrote is good for telling when variational posteriors are bad, but the k-hat diagnostic being small does not necessarily mean that the variational posterior will be good. (And, tbh, that’s why we recommended polishing it with importance sampling)
• But that puts us in good company, because they show that neither the KL divergence that’s used in deriving the ELBO or the Renyi divergence is a particularly good measure of the quality of the solution.
• The first of these is not all that surprising. I think it’s been long acknowledged that the KL divergence used to derive variational posteriors is the wrong way around!
• I do love the Wasserstein distance (or as an extremely pissy footnote in my copy of Bogachev’s glorious two volume treatise on measure theory insists: the KantorovichRubinstein metric). It’s so strong. I think it does CrossFit. (Side note: I saw a fabulous version of A Streetcar Named Desire in Toronto [Runs til Oct 27] last week and really it must be so much easier to find decent Stanleys since CrossFit became a thing.)
• The Hellinger distance is strong too and will also control the moments (under some conditions. See Lemma 6.3.7 of Andrew Stuart’s encyclopedia)
• Reading the paper sequentially, I get to Lemma 4.2 and think “ooh. that could be very loose”. And then I get excited about minimizing over $\eta$ in Theorem 4.3 because I contain multitudes.
• Maybe my one point of slight disagreement with this paper is where they agree with our paper. Because, as I said, I contain multitudes. They point out that it’s useful to polish VI estimates with importance sampling, but argue that they can compute their estimate of VI error instead of k-hat. I’d argue that you need to compute both because just like we didn’t show that small k-hat guarantees a good variational posterior, they don’t show that a good approximate upper bound on the Wasserstein distance guarantees that importance sampling will work. So ha! (In particular, Chatterjee and Diaconis argue very strongly, as does Mackay in his book, that the variance of an importance sampler being finite is somewhere near meaningless as a practical guarantee that an importance sampler actually works in moderate to high dimensions.)
• But that is nought but a minor quibble, because I completely and absolutely agree with the workflow for Variational Inference that they propose in Section 4.3.
• Let’s not kid ourselves here. The technical tools in this paper are really nice.
• There is not a single example I hate more than the 8 schools problem. It is the MNIST of hierarchical modelling. Here’s hoping it doesn’t have any special features that makes it a bad generic example of how things work!
• That said, it definitely shows that k-hat isn’t enough to guarantee good posterior behaviour.

Anyway. Here’s to more papers like this and to fewer examples of what the late, great David Berman referred to as ceaseless feasts of schadenfreude“.