(This article was originally published at Statistical Modeling, Causal Inference, and Social Science, and syndicated at StatsBlogs.)
We recently had an email discussion among the Stan team regarding the use of predictive accuracy in evaluating computing algorithms. I thought this could be of general interest so I’m sharing it here.
It started when Bob said he’d been at a meting on probabilistic programming where there was confusion on evaluation. In particular, some of the people at the meeting had the naive view that you could just compare everything on cross-validated proportion-predicted-correct for binary data.
But this won’t work, for three reasons:
1. With binary data, cross-validation is noisy. Model B can be much better than model A but the difference might barely show up in the empirical cross-validation, even for a large data set. Wei Wang and I discuss that point in our article, Difficulty of selecting among multilevel models using predictive accuracy.
2. 0-1 loss is not in general a good measure. You can see this by supposing you’re predicting a rare disease. Upping the estimated probability from 1 in a million to 1 in a thousand will have zero effect on your 0-1 loss (your best point prediction is 0 in either case) but it can be a big real-world improvement.
3. And, of course, a corpus is just a corpus. What predicts well in one corpus might not generalize. That’s one reason we like to understand our predictive models if possible.
Bob in particular felt strongly about point 1 above. He wrote:
Given that everyone (except maybe those SVM folks) are doing *probabilistic* programming, why not use log loss? That’s the metric that most of the Kaggle competitions moved to. It tests how well calibrated the probability statements of a model are in a way that neither 0/1 loss, squared error, or ROC curve metrics like mean precision don’t.
My own story dealing with this involved a machine learning
researcher trying to predict industrial failures who built a logistic regression where the highest likelihood of a component failure was 0.2 or so. They were confused because the model didn’t seem to predict any failures at all, which seemed wrong. That’s just a failure to think in terms of expectations (20 parts with a 20% chance of failure each would lead to 4 expected failures). I also tried explaining that the model may be well calibrated and there may not be a part that has more than a 20% chance of failure. But they wound up doing what PPAML’s about to do for the image tagging task, namely compute some kind of ROC curve evaluation based on varying thresholds, which of course, doesn’t measure how well calibrated the probabilities are, because it’s only sensitive to ranking.
Tom Dietterich concurred:
Regarding holdout likelihood, yes, this is an excellent suggestion. We have evaluated on hold-out likelihood on some of our previous challenge problems. In CP6, we focused on the other metrics (mAP and balanced error rate) because that is what the competing “machine learning” methods employed.
Within the machine learning/computer vision/natural language processing communities, there is a wide-spread belief that fitting to optimize metrics related to the specific decision problem in the application is a superior approach. It would be interesting to study that question more deeply.
To which Bob elaborated:
I completely agree, which is why I don’t like things like mean average precision (MAP), balanced 0/1 loss, and balanced F measure, none of which relate to any relevant decision problem.
It’s also why I don’t like 0/1 loss (either straight up, through balanced F measures, through macro-averaged F measure, etc.), because that’s never the operating point anyone wants. At least in 10 years working in industrial machine learning, it was never the decision problem anyone wanted. Customers almost always had asymmetric utility for false positives and false negatives (think epidemiology, suggesting search spelling corrections, speech recognition in an online dialogue system for airplane reservations, etc.) and wanted to operate at either very high precision (positive predictive accuracy) or very high recall (sensitivity). No customer or application I’ve ever seen other than writing NIPS or Computational Linguistics papers ever cared about balanced F measure in a large data set in an application.
The advantage of log loss is a better measure for generic decision making than area under the curve because it measures how well calibrated the probabilistic inferences are. Well-calibrated inferences are optimal for all decision operating points assuming you want to make Bayes-optimal decisions to maximize expected utility while minimizing risk. There’s a ton of theory around this, starting with Berger’s influential book on Bayesian decision theory from the 1980s. And it doesn’t just apply to Bayesian models, though almost everything in the machine learning world can be viewed as an approximate Bayesian technique.
Being Bayesian, the log loss isn’t a simple log likelihood with point estimated parameters plugged in (popular approximate technique in the machine learning world), but a true posterior predictive estimate as I described in my paper. Of course, if your computing power isn’t up to it, you can approximate with
point estimates and log loss by treating your posterior as a delta function around its mean (or even mode if you can’t even do variational inference).
Sometimes ranking is enough of a proxy for decision making, which is why mean average precision (truncated to high precison, say average precision at 5) is relevant for some search apps, such as Google’s, and mean average precision
(truncated to high recall) is relevant to other search apps, such as that of a biology post-doc or an intelligence analyst. I used to do a lot of work with DoD and DARPA and they were quite keen to have very very high recall — the intelligence analysts really didn’t like systems that had 90% recall so that 10% of the data were missed! At some points, I think they
kept us in the evaluations because provided an exact boolean search that had 100% recall, so they could look at the data, type in a phrase, and be guaranteed to find it. That doesn’t work with first-pass first-best analyses.
I suggested to Bob that he blog this but then we decided it would be more time-efficient for me to do it. The only thing is, then it won’t appear till October.
P.S. Here are Bob’s slides from that conference. He spoke on Stan.
Please comment on the article here: Statistical Modeling, Causal Inference, and Social Science