Tools of the trade: Classification ML F1 accuracy

This is the fourth post in a series of blog posts using a theme of “Tools of the trade”. The series explains software tools, statistical concepts, data science techniques, or related items. In all cases, the topic will contribute to accomplishing data science tasks. The target audience for the posts is engineers, analysts, and managers who want to build their knowledge and skills in data science.

This post follows up the last posts on confusion matrix and accuracy by discussing a more advanced measure of classification ML performance: F1 accuracy.

What is it?

Before defining F1 accuracy, there are two other metrics that need introduction: precision and recall. Precision tells us how many of total predictions specified as positive are correctly assigned. This is also known as positive predictive value. Recall is the total number of the actual positive cases that were predicted correctly. This is also known as sensitivity.

Referring to the confusion matrix above:

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)

The F1 measure combines precision and recall. The result is the harmonic mean of the two values and is calculated as follows:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

Let’s look at a concrete example. Back in https://lakedatainsights.com/2019/12/01/tools-of-the-trade-confusion-matrix/, I used the following example.

Using Dog as the positive answer yields:

Precision = 24 / (24+2) = 0.9231
Recall = 24/(24+6) = 0.8
F1 = 2*(0.9231*0.8) / (0.9231+0.8) = 0.8572

As you can see, the F1 value is between the values for precision and recall.

How do I use it?

While not as easy to understand, F1 accuracy provides some nuance to the basic accuracy number. It also can help with unbalanced datasets as we will see in the following discussion.

Discussion

My last post, https://lakedatainsights.com/2019/12/22/tools-of-the-trade-classification-ml-accuracy/, compared the following two confusion matrices. Even though the first model had a lower accuracy, it was deemed to be a more useful model since it showed more improvement than the default ‘guess’ of an On time payment.

丩 0

Von
On
70

Let’s see how these two models compare when using the F1 score since the F1 score factors in precision and recall for each state. The F1 macro calculation then averages the F1 score across the states to determine an overall F1 score. There are other F1 variants, but I’m most interested in the macro version given the equal consideration of all three states.

To simplify the calculations, I built out sample arrays to match the actual and predicted and used sklearn’s metrics library in Python to calculate the values. Here’s the result:

Model	Naive guess	Accuracy	F1 Macro
Model 1	0.5	0.73	0.67
Model 2	0.80	0.83	0.66

For a bit more detail on how this works, here’s the sklearn.metrics classification_report for model 1. The states of On time, Late, and Very late are represented by the rows labeled 1, 2, and 3. The macro average is simply the average of the f1-score column.

	precision	recall	f1-score
1	0.83	0.80	0.82
2	0.68	0.71	0.69
3	0.50	0.50	0.50

macro avg: 0.67

Recommendations for using F1 accuracy

As the results above show, the two models have nearly identical F1 Macro accuracy scores. In this and many cases, F1 accuracy will provide a better indicator of a model’s capability. As with accuracy, understanding what is most important to consider in the model is critical for interpreting the results.

References

https://en.wikipedia.org/wiki/Precision_and_recall

https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

Picture details: Maplewood State Park, 2/17/2020, iPhone 7, f/1.8, 1/2283 s, ISO-20