Tools of the trade: Classification ML accuracy

This is the third post in a series of blog posts using a theme of “Tools of the trade”. The series targets software tools, statistical concepts, data science techniques, or related items. In all cases, the topic will contribute to accomplishing data science tasks. The target audience for the posts is engineers, analysts, and managers who want to build their knowledge and skills in data science.

This post follows up the last post on the confusion matrix by discussing the most basic measure of classification ML performance – accuracy.

What is it?

The goal of a classification ML problem is to predict an output that is two or more possible outcomes (or classes). This post will focus on a multi-class classifier which predicts more than two outcomes.

After a supervised ML problem is trained on a set of historical data, it is tested with data that is ‘held out’ of the training process. This allows us to compare the predictions from the trained model with the actual values.

Accuracy is simply the number of predictions that a model got correct in the validation dataset. Using the second example from my last post, the accuracy is (40 + 25 + 8) / 100, or 0.73.

How do I use it?

Accuracy is a simple metric to understand and that makes it a good starting point for explaining a model to others, especially users of the model that are not data scientists. No understanding of statistics is required to understand the model’s accuracy. Having a confusion matrix available provides further insight to the model’s performance.

But there are several challenges associated with accuracy that I will elaborate on in the discussion below. The usefulness of the metric depends on the context of the problem.

Discussion

A frequent question when considering model performance is “how good is the model?”. Here’s an example of why the answer to that question may not be straight forward. Consider the confusion matrix below (model 2).

A quick calculation of the accuracy for this model results in a value of (70 + 10 + 3) / 100, or 0.83. On the surface, this seems like a better result than the model above (model 1) with its accuracy of 0.73. But is it?

In order to answer that question, you need to start with the accuracy of a naïve guess. For a classification problem, a simple guess would be to always predict the most common class. For model 1, that guess would be “On time” and would produce an accuracy of 0.50. The guess for model 2 would also be “On time” and would produce an accuracy of 0.80.

Coming back to the question of which model is better… model 1 improves on the naïve guess by 0.73 – 0.50 = 0.23 whereas model 2 improves on the naïve guess by 0.83 – 0.80 = 0.03. So model 1 is a better model even though it has lower accuracy. And the point is that you need more context than just the value of accuracy to assess model quality.

There’s another aspect to this that is worth mentioning. Let’s say you have a medical test that is used to detect a disease in a patient. It’s a binary classification problem where a positive result indicates the patient has the disease. In this scenario, you have to think about the impact of the following:

false positives where the test says the patient has the disease but actually doesn’t
false negatives where the test says the patient doesn’t have the disease but actually does

Obviously neither of these error types are good, but which is worse? Again, it depends. If it’s a life threatening disease that requires fast treatment, minimizing the false negatives (hopefully followed by additional tests!) would take priority. Other scenarios involving less critical situations might direct the model creators to minimize false positives. The bottom line is that you need more than an accuracy metric to decide on model quality.

Recommendations for using accuracy

As noted above, accuracy is a simple metric and that makes it an important tool for communicating with domain experts who aren’t familiar with statistics. That said, providing accuracy with additional context is critical to make it useful.

For the payment prediction scenario modeled above, I’ve settled on a target for the ML model that factors in different payment behaviors. The target is that the model should improve upon a naïve guess by reducing the wrong answers by at least 50%. Put another way, I want a target accuracy that splits the different between naïve guess accuracy and 100%.

I’ve summarized this for the confusion matrices in this post below:

Model	Naïve guess	Target	Model accuracy	Meets goal?
Model 1	0.50	0.75	0.73	Almost… this model improves upon the guess significantly
Model 2	0.80	0.90	0.83	No… we need to do better than this.

This is a pretty long post for a seemingly simple metric, proving my point that it’s not as simple as it looks. Use classification accuracy, but make sure you provide additional context when you do so!

Picture details: Superior shoreline, 8/6/2019, Canon PowerShot SD4000 IS, f/4, 1/320 s, ISO-1600, –1 step