Three attributes of a quality ML solution

We’ve been working on a machine learning (ML) product capability over the last several weeks. I’ve been framing our discussion about what a quality solution will be around three attributes: accuracy, robustness, and trustworthiness.

Accuracy

Of these three attributes, the accuracy of the predictions from an ML model is typically brought up most often. This is not surprising since making good predictions that drive better decision making is the goal of most ML models.

Determining model accuracy is more nuanced than what it might seem to be on the surface. You have to decide what metric you’re going to use for the measurement and what determines a ‘good enough’ accuracy to drive better decisions.

Take a multi-class classification problem where the goal is to predict one of several classes. The most common and easiest to understand metric is called ‘accuracy’ and is the number of predictions from a test dataset that the model got correct. This is typically visualized in a confusion matrix where the diagonal values are the ones where the model predicted correctly.

For example, let’s say we’re creating an ML model for car dealers and the goal is to predict which of three SUV models for a given brand a potential customer might buy. We recently bought a new Subaru and the three possible choices were the Forester, Outback, and Ascent. If we walked into a Honda dealer, the choices would be CRV, Passport, and Pilot.

For Brand A, the ML model has 75% accuracy whereas the model has 90% accuracy for Brand B. Is the model performing better for brand B? A helpful way to evaluate this is to consider what a naïve guess might deliver for accuracy. Usually that guess involves finding the most common result class and always predicting that value. Let’s say the historical percentage of customers buying a particular SUV at the two brands is as shown in the table below.

	SUV 1	SUV 2	SUV 3	Naive guess	ML Model Accuracy	Improvement
Brand A	50%	25%	25%	50%	75%	25%
Brand B	10%	75%	15%	75%	90%	15%

In this case, the ML model for Brand A is actually showing much better improvement than for Brand B, even though Brand B has higher accuracy.

Using the accuracy metric for a classification problem is straightforward and easy to explain, but not always the best. Considering the impact of a false positive or a false negative may lead you to use precision or recall as the preferred metric. The F1 score incorporates both precision and recall. Machine Learning Mastery provides a good overview of these metrics at https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/.

A metric that I like for comparing classification models is log loss. It takes into account the underlying probabilities for each class and is frequently used at Kaggle.com. The Kaggle NCAA March Madness competition is one example of this where your score is dependent on the probability result for the team that won the game. If your model had a low probability for the team that ultimately won the game, you will be ‘punished’ with a high log loss value. This hurts your rank as the lowest overall log loss wins the competition.

Robustness

The data preparation phase is the primary place where model robustness needs to be considered. Effective handling of missing values, dealing with potential features that are not actually predictive, and a managing data anomalies are required for a robust solution.

This is especially true when building a general purpose product capability as in my SUV model above. Even if the data sources were the same, the potential features for the model may not be used the same by each brand and the importance of different features will be different. Building a robust modeling capability in this scenario is much more challenging than building a model for a specific brand and scenario.

Another consideration in the robustness category is ensuring the model is not over-trained. An over-trained appears to have high accuracy but doesn’t do well with new inputs for prediction. It doesn’t generalize well to this new data. Using techniques that minimize the variance error and ensuring accuracy is measured with a true hold out dataset are a couple of mitigations. Read more about the bias variance trade-off at https://en.wikipedia.org/wiki/Bias–variance_tradeoff.

Trustworthiness

Trustworthiness is ultimately the most important attribute of these three. A model must be accurate and robust to be trustworthy, but those attributes are insufficient to earn the full trust of the ML model user. User confidence is gained by understanding the model’s behavior. This is where model explainability comes into play (see Explaining Explainable AI).

Let’s say a young couple walks into a dealership looking for an SUV. The salesman looks up their information and gets a recommendation to sell them the largest SUV as they have an 80% likelihood of going that route. The salesman will trust the model further is he gets the additional insight that the most important factors for the model are that couple has two children… and a dog that is over 70 pounds. This will build trust much faster than a black box model. Much of this earned trust has to do with the overall user experience built into the app that uses the ML capability.

Enabling data informed experts requires ML solutions that are accurate, robust, and trustworthy. Hopefully this post explains why this is the case. Ideas on how to achieve these three attributes will need to be saved for future posts.