(This article was originally published at Win-Vector Blog » Statistics, and syndicated at StatsBlogs.)

A bit more on the ROC/AUC

The receiver operating characteristic curve (or ROC) is one of the standard methods to evaluate a scoring system. Nina Zumel has described its application, but we would like to emphasize out some additional details. In my opinion while the ROC is a useful tool, the “area under the curve” (AUC) summary often read off it is not as intuitive and interpretable as one would hope or some writers assert.The situation where a ROC analysis is useful is quite common. Typically it is in a machine learning situation where you are training a classifier to decide between two categories. We will call the categories “true” and “false”. Many machine learning techniques (regression, Naive Bayes, logistic regression, SVM and so on) actually produce something more useful than a single classifier. They produce a scoring function where, if the classifier is working well, higher score indicates a higher probability of an item belonging to the “true” class. To turn a scoring function into a classifier (or decision procedure) all that is needed is a cut-off score (called a threshold). All items receiving a score above the threshold are called “positive” (hopefully the predicted analogue for “true”) and all items receiving a score below the threshold are called “negative” (hopefully the predicted analogue for “false”). By picking different thresholds one can get different (though related classifiers) that represent different trade-offs between true_positive_rate (also called sensitivity) and false_positive_rate (also called fall-out, see ROC). Or (equivalently) we are only concerned with trade-offs among competing classifier performance metrics such as precision, recall and specificity. To our mind this is best graphical representation of the the family of classifiers derivable from a single scoring function is the “double hump graph.” This is either a pair of density plots (where areas are re-normalized to integrate to one) or pair of histograms (where area is proportional to number of training examples) that simultaneously show the score distribution on true and false examples. If the two distributions are somewhat separated we get a family of thresholds that yield high quality classifiers. If the two distributions are identical we have nothing.

An example of a double-hump graph (from “I don’t think that means what you think it means;” Statistics to English Translation, Part 1: Accuracy Measures) is given here:

From the same source the ROC curve is as follows:

The ROC curve’s strength and weakness is the observation that classifier score is itself a nuisance parameter. Beyond the effects on classifier behavior we do not care about actual values of the score. We only care about the different true_positive_rates and false_positive_rates that are simultaneously achievable, not what scores achieve them (that is in implementation detail). The ROC plot is exactly the plot of the efficient frontier ofsimultaneous true_positive_rates and false_positives_rates. We will abbreviate these rates as tpr and fpr. The ROC plot can be produced by for every possible threshold: compute the tpr and fpr induced by this threshold and add the point (tpr,fpr) to the curve. Notice: the threshold does not make it to the graph- it is lost in in the parametric description of the ROC curve. What is missing is: the speed the score parameter is moving along the curve and what density of data is associate with each score. These is evident in the double hump plots where they are drawn as areas.

An issue is interpreting the ROC graph. I remember seeing a presenter show a classifier they were proud of at KDD where the curve nearly followed the diagonal line y=x. The presenter appeared blissfully unaware that random selection achieves this useless level of performance. However, it is natural to talk about area under the curve (AUC), it is what you see so you should have some feel for it. The first thing to remember: is you get the first 1/2 units of area for free (no credit to you there) it is only the second 1/2 that is at all meaningful. One should be suspicious area on ROC graphs. Typically you are going to pick only one threshold and live with that classifier, the perceived performance of classifiers you are not using is not entirely relevant. Also many scoring functions, for example those coming from naive Bayes are in fact highly concentrated and really only yield one usable classifier.

There is an extreme desire to bind AUC quickly to intuition. And this desire leads to a number of well-meaning, but ultimately incorrect heuristics.

For example, “Why do Nigerian Scammers Say They are from Nigeria?” Cormac Herley states: “The Area Under the Curve (AUC) is the probability that the classifier ranks a randomly chosen true positive higher than a randomly chosen false positive.” This actually can not be true just on the face of it. Which examples (and at what rate) are positive and negative (let alone true positive and false positive) is a function of what single threshold is chosen. So one of the two statements (comparison rates of true positives to false positives) varies with threshold and the other (AUC) does not (as it is a graph over all thresholds). The two quantities can not, in general, be the same.

That is perhaps a mangling of Wikipedia’s: “The area under curve (AUC), when using normalized units, is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming ‘positive’ ranks higher than ‘negative’).” We are assuming they are using positive/negative as we are using true/false (and will make the necessary translations). Which the Wikipedia in turn attributes to Fawcett, Tom (2006) “An introduction to ROC analysis”, Pattern Recognition Letters, 27, 861–874. The paper is behind the usual inexcusable ACM paywall, so we are not going to go to the original reference to confirm the source of this misleading statement. So we can not, of course, presume to criticize the calculations and lessons of “An introduction to ROC analysis”, but we can criticize the implication that area has an obvious (one not needing to be tutored in) interpretation.

We will work an example to illustrate why you should not trust a statement of this form. Consider the following trivial data set:

datum number | classifier score | actual class |
---|---|---|

1 | 0.1 | false |

2 | 0.1 | false |

3 | 0.1 | true |

4 | 0.9 | true |

The only interesting score thresholds are score≥0.1 and score≥0.9. We know fpr = true_positives/total_positives and fpr = false_positives/total_negatives. So:

- At score≥0.1 we have tpr = 2/2 and fpr = 2/2
- At score≥0.9 we have tpr = 1/2 and fpr = 0/2

So the AUC is 3/4. But the probability of (a random true example scores above random false example) is: 1/2 as all false examples have score 0 and half the true examples have score>1/2. The two numbers to not agree when we compute expectations with respect to the training data (the natural or empirical way to estimate this value). But that is ignoring the weasel-word “normalized units” from the claim we are disputing.

Working from “the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one” which is:

(sum_{a:true} sum_{b:false} if(score(a)>score(b))) / (sum_{a:true} sum_{b:false} 1)

(where if(x) is 1 if x is true and 0 otherwise) we can get to the equivalent expression:

(sum_{a:true} (1 – fpr(score(a)) (sum_{b:false} 1) ) / (sum_{a:true} sum_{b:false} 1)

(where frp(score(a)) is the false positive rate seen by the classifier using a threshold of score(a)).

This reduces to:

(sum_{a:true} (1 – fpr(score(a)) ) / (sum_{a:true} 1)

which is exactly (by definition) the expected value of 1-fpr(score(a)) where a is drawn uniformly from the true training examples. If we had an additional “normalized units” assumption like tpr(score(a)) is uniformly distributed from 0 to 1 as we draw true a’s uniformly from the training set, then: this quantity would indeed equal the AUC (as it is essentially numerically integrating this figure by adding up row-areas).

However, an actual practicing data scientist should never be comfortable assuming tpr(score(a)) is uniformly distributed from 0 to 1 when we draw true a’s uniformly from the training set. We most often see this for low-quality classifiers where both the true and negative examples have a very wide series of scores that do not separate the training classes (i.e. ROC curves trapped near the y=x diagonal). For ROC curves with larger area it has been my experience the score tends to be concentrated at a few values. That, is unfortunately, more like our toy problem (with only two possible scores) than the uniform case. IMost often tpr(score(a)) (where a is drawn uniformly from the true training examples) is concentrated (in the sense of the law of large numbers) at only two scores: a high score and a low score (this is true for both tpr(score(a)) and fpr(score(a)) for true example, false examples and all examples).

This (undesirable) effect is most obvious for the Naive Bayes scoring function working over a very large set of features (as is common in text classification). In this the logarithm of the naive Bayes score is essentially the sum of a large number of indicators (each indicator describing the logarithm of thenaive Bayes impact of a specific feature being present or absent) and the score ends up concentrated. It ends up concentrated at two point (not one) based on the the shared conditioning as to if the example looks more like a true training example (high score, low frp and tpr) or a false training example (low score, high fpr and tpr).

When the score is concentrated at two points the ROC curved is essentially two line segments: a segment from (0,0) to (f,t) and a segment from (f,t) to (1,1). (f,t) is the pair (fpr(s_h),tpr(s_h)) where s_h is the high concentration score. The curve we illustrated is “nearly” that shape if we take (f,t) = (0.9,0.9). In this cartoon case the AUC is 0.5*(1+t-f) and the probability a true training example scores higher than a false training example (both drawn uniformly from the training set, which is also stated as “from the empirical distribution”) is p_h*(1-fpr(s_h)) + (1-p_h)*(1-fpr(s_l)) where p_h is the probability a true raining example gets a score near the high score s_h and s_l is the low score. This is approximately t*(1-f) (using estimates: p_h ~ t, fpr(s_h) = f and fpr(s_l) ~ 1, these are not relation that will hold in general- but will hold for data sets that are sufficiently concentrated near two scores). The second term depends on how often training examples are at different points on the ROC curve (which is as we mentioned earlier: is exactly what the ROC curve misses when it abstracts out the score parameter and data densities to present tpr as a function of fpr). As you would expect the two equations (only valid for scores concentrated at two values) “AUC ~ 0.5*(1+t-f)” and “t*(1-f).” are sufficiently distinct that they tend not to agree on actual data, and it is easy to build examples where they do disagree. Notice that t*(1-f) does in fact have a quick interpretation as an area, it is the area of the rectangle that (f,t) is the upper-left corner of (a component of the AUC but not the entire AUC).

The ROC curve is a useful tool, but you have to use it for appropriate tasks. We know when looking at it you can’t help but directly see the area (AUC). However, in our opinion, the AUC does not have as straightforward an intuition as we would hope. Indeed, over actual data usually only a few high-quality classifiers can be built by choosing threshold (due to the score often being concentrated at a few values). This contradicts the expectation formed from the diagram that you have easy access to many classifiers with a wide range of dial-able behaviors. Without an explicit cost of error model (cost of false positives and separate cost of false negatives) you should always be suspicious of a single number summary of a classifier performance (be it accuracy, AUC, F1 or so on). We in fact prefer using both precision and recall. If you insist on a single number: the F1 is a good *heuristic* measure of classifier quality, as it at least incorporates our operational choice of score threshold into the quality assessment. The ROC curve is useful tool designing a classifier from a scoring function (though we prefer the “double hump graph”), but once you have chosen a threshold the performance of the other classifiers (induced by choosing different thresholds) are irrelevant to assessing the performance of the classifier you have settled on.

**Please comment on the article here:** **Win-Vector Blog » Statistics**