Tools of the trade: Log loss metric for classification ML

This is the fifth post in a series of blog posts using a theme of “Tools of the trade”. The series targets software tools, statistical concepts, data science techniques, or related items. In all cases, the topic will contribute to accomplishing data science tasks. The target audience for the posts is engineers, analysts, and managers who want to build their knowledge and skills in data science.

This post follows up the last posts on classification metrics by discussing another more advanced measure of classification ML performance – log loss.

What is it?

The last few Tools of the trade posts have focused on the predicted class for a classification problem by looking at the confusion matrix, accuracy, and F1 accuracy. Behind the scenes of that predicted class is a probability for each class. In the simplest case, the predicted class is the class with the highest probability. For binary classification problems, using a specific probability threshold based on domain insight is also commonly used to determine which class to predict.

The underlying probability is often a more interesting data point than the predicted class. For example, in a customer payment prediction scenario with On time, Late, and Very late as the class states, payment collection behavior for an invoice with an On time prediction with a 35% probability should be different than an On time prediction with a 90% probability. Forecasting cash flow based on these two scenarios should also be significantly different.

Log loss is a way to assess multiple models based on the underlying probabilities. The goal is to minimize the log loss value for the model. The formula looks a bit gnarly, but the intuition is quite straight forward. So I’ll spend more time on the intuition than on the actual formula.

First we need to take a short trip back to high school mathematics class. The logarithmic function, or log, is the inverse of the exponent function. Using a base 10 log, the log of 100 is 2 (10^2 = 100). Likewise, the log of 0.01 is -2 (10^-2 = 0.01). Also noteworthy for this discussion is the fact that the log of 1 is 0 (10^0 = 1). The more commonly used log function is the ‘natural’ log, but we’ll keep using the base 10 log since it is easy to explain.

The probability of the predicted class is a number between 0 and 1. Since this range of numbers will produce a negative log, the sign of the overall calculation is changed to minimize a positive value. Here are three examples:

A confident correct prediction, such as 0.90, will have a small log value, for example, -(log(0.9)) = 0.1054.
A less confident correct prediction, such as 0.35, will have a larger log value – for example, -(log(0.35)) = 1.0498.
Things get even more interesting when the predicted class is incorrect. Let’s say an invoice in the test dataset has an actual outcome of Very late but the model predicted only a 5% likelihood of that happening. In that case, the log loss calculation results in -(log(0.05)) = 2.9957.

In summary:

We want to minimize the log loss for all predictions in the test dataset.
A confident correct prediction (high probability) will have a small log loss value.
A less confident correct prediction (medium probability) will have a medium log loss value.
An incorrect prediction, specifically one that had a low probability, will have a large log loss value.

All of this leads to the following excerpt from https://scikit-learn.org/stable/modules/model_evaluation.html#log-loss.

This extends to the multiclass case as follows. Let the true labels for a set of samples be encoded as a I-of-K binary indicator matrix
1 if sample i has label k taken from a set of K labels. Let P be a matrix of probability estimates, with
pi,k = = 1). Then the log loss of the whole set is
= YIP) -

How do I use it?

Log loss is a common measurement for the competitions at https://www.kaggle.com. In particular, the NCAA March Madness basketball tourney competition uses it. This is where I was first exposed to log loss. The predicted results that you submit must confidently predict the game winners to minimize log loss… keeping in mind that you will be punished if your confident prediction is wrong.

I came across a related concept in the book, Business Data Science by Matt Taddy (https://www.amazon.com/Business-Data-Science-Combining-Accelerate/dp/1260452778). He suggested using box plots to show the probability ranges for the actual class of the result. Here’s an example of this for customer payment predictions. You can see that the median probability for each expected class is about the same at around 0.5. Very Late has a slightly higher median probability, but it also has a wider range of outcomes.

A third common usage for log loss is for model comparison when selecting the optimal model.

Discussion

While not as intuitive as a class based measure like accuracy, log loss does a much better job at evaluating the probabilities underlying the predicted class in a classification ML problem. The box plot showing the probabilities for the expected class helps significantly with understanding the probability results for a model’s predictions.

References

https://datawookie.netlify.com/blog/2015/12/making-sense-of-logarithmic-loss/

https://en.wikipedia.org/wiki/Logarithm

https://www.amazon.com/Business-Data-Science-Combining-Accelerate/dp/1260452778

Picture details: Winter sky, 2/24/20, iPhone 7s, f/1.8, 1/1842 s, ISO-20