Tools of the trade: Confusion matrix

This is the second post in a series of blog posts using a theme of “Tools of the trade”. The series targets software tools, statistical concepts, data science techniques, or related items. In all cases, the topic will contribute to accomplishing data science tasks. The target audience for the posts is engineers, analysts, and managers who want to build their knowledge and skills in data science.

Today’s topic: the confusion matrix

What is it?

The confusion matrix is a tool to help us understand the behavior of a classification machine learning (ML) problem. The goal of a classification ML problem is to predict an output that is two or more possible outcomes (or classes). For example, the ML classifier could predict if someone has a disease or not. This is a binary classifier. A multi-class classifier predicts more than two outcomes.

After a supervised ML problem is trained on a set of historical data, it is tested with data that is ‘held out’ of the training process. This allows us to compare the predictions from the trained model with the actual values. The confusion matrix is an excellent way to evaluate the success of a classification problem and where it makes mistakes (gets confused).

Let’s say we’re trying to predict if a pet is a dog or a cat based on some physical and behavioral attributes. Suppose we have a test dataset that has 30 dogs and 20 cats in it. A confusion matrix could look like the following:

The green numbers represent correct predictions. You can quickly see that the model predicted a higher percent of the actual cats correctly. The overall accuracy of the model, 42/50 = 0.84, is also quite easy to calculate.

How do I use it?

Most discussion about the confusion matrix focuses on binary classifiers like the example above. This is a special case where there are several other metrics that can be reasoned over such as sensitivity and recall.

The classification problem that I’ve been focused on recently is a finance scenario that has three states. The model predicts when a customer invoice is going go be paid: On time, Late, or Very late. In this case, let’s say that we have 100 test invoices with 50 that are actually paid on time, 35 that are actually late, and 15 that are actually very late. A model might have a confusion matrix like the following:

Discussion

A confusion matrix provides significantly more information than a simple accuracy metric and yet is still easy to understand, It tells you if you have a balanced dataset where the output classes have similar counts. For the multi-class scenario, it tells you how far off a prediction might be when the output classes are ordinal as in the customer payment example.

Different accuracy metrics have the advantage of quantifying the model quality. Future posts will discuss metrics used with classification problems.

Picture details: Lake Superior, taken 8/5/2019, Canon PowerShot SD4000 IS, F/5.6, 1/100 s, ISO-1600, –0.7 step