This is the fourth post in a series of blog posts using a theme of “Tools of the trade”. The series explains software tools, statistical concepts, data science techniques, or related items. In all cases, the topic will contribute to accomplishing data science tasks. The target audience for the posts is engineers, analysts, and managers who want to build their knowledge and skills in data science.
This post follows up the last posts on confusion matrix and accuracy by discussing a more advanced measure of classification ML performance: F1 accuracy.
What is it?
Before defining F1 accuracy, there are two other metrics that need introduction: precision and recall. Precision tells us how many of total predictions specified as positive are correctly assigned. This is also known as positive predictive value. Recall is the total number of the actual positive cases that were predicted correctly. This is also known as sensitivity.
Referring to the confusion matrix above:
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)
The F1 measure combines precision and recall. The result is the harmonic mean of the two values and is calculated as follows:
- F1 = 2 * (Precision * Recall) / (Precision + Recall)
Let’s look at a concrete example. Back in https://lakedatainsights.com/2019/12/01/tools-of-the-trade-confusion-matrix/, I used the following example.
Using Dog as the positive answer yields:
- Precision = 24 / (24+2) = 0.9231
- Recall = 24/(24+6) = 0.8
- F1 = 2*(0.9231*0.8) / (0.9231+0.8) = 0.8572
As you can see, the F1 value is between the values for precision and recall.
How do I use it?
While not as easy to understand, F1 accuracy provides some nuance to the basic accuracy number. It also can help with unbalanced datasets as we will see in the following discussion.
Discussion
My last post, https://lakedatainsights.com/2019/12/22/tools-of-the-trade-classification-ml-accuracy/, compared the following two confusion matrices. Even though the first model had a lower accuracy, it was deemed to be a more useful model since it showed more improvement than the default ‘guess’ of an On time payment.
Let’s see how these two models compare when using the F1 score since the F1 score factors in precision and recall for each state. The F1 macro calculation then averages the F1 score across the states to determine an overall F1 score. There are other F1 variants, but I’m most interested in the macro version given the equal consideration of all three states.
To simplify the calculations, I built out sample arrays to match the actual and predicted and used sklearn’s metrics library in Python to calculate the values. Here’s the result:
Model | Naive guess | Accuracy | F1 Macro |
Model 1 | 0.5 | 0.73 | 0.67 |
Model 2 | 0.80 | 0.83 | 0.66 |
For a bit more detail on how this works, here’s the sklearn.metrics classification_report for model 1. The states of On time, Late, and Very late are represented by the rows labeled 1, 2, and 3. The macro average is simply the average of the f1-score column.
precision | recall | f1-score | |
1 | 0.83 | 0.80 | 0.82 |
2 | 0.68 | 0.71 | 0.69 |
3 | 0.50 | 0.50 | 0.50 |
macro avg: 0.67
Recommendations for using F1 accuracy
As the results above show, the two models have nearly identical F1 Macro accuracy scores. In this and many cases, F1 accuracy will provide a better indicator of a model’s capability. As with accuracy, understanding what is most important to consider in the model is critical for interpreting the results.
References
https://en.wikipedia.org/wiki/Precision_and_recall
https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
Picture details: Maplewood State Park, 2/17/2020, iPhone 7, f/1.8, 1/2283 s, ISO-20