Tools of the trade: AutoML

IMG_2535

This is the 11th post in a series of blog posts using a theme of “Tools of the trade”. The series targets software tools, statistical concepts, data science techniques, or related items. In all cases, the topic will contribute to accomplishing data science tasks. The target audience for the posts is engineers, analysts, and managers who want to build their knowledge and skills in data science, particularly those in the Microsoft Dynamics ecosystem.

This post focuses on automated machine learning, or AutoML. The specific tool is Azure Auto ML.

What is it?

The Azure docs introduction defines AutoML as the following (https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml):

“Automated machine learning, also referred to as automated ML or AutoML, is the process of automating the time consuming, iterative tasks of machine learning model development. It allows data scientists, analysts, and developers to build ML models with high scale, efficiency, and productivity all while sustaining model quality.”

The scope of AutoML is an interesting topic. Wikipedia asserts that “AutoML covers the complete pipeline from the raw dataset to the deployable machine learning model.” (Automated machine learning – Wikipedia). While there are opportunities for the complete pipeline, the best opportunity is in the modeling phase of the data science lifecycle (see CRISP-DM). In this phase, the goal is to find the algorithm and set of hyperparameters that delivers the best ML model performance. After identifying the targeted metric to minimize or maximize, AutoML becomes an optimization problem that searches for the best algorithm + hyperparameter combination.

Please see the paper in the References below for a detailed analysis of AutoML capabilities.

How do I use it?

For Dynamics 365 Finance Insights, we don’t provide pre-trained ML models. Instead, we provide a template that is used by each customer to train an ML model using their data. In order for this to work out of the box without the benefit of a data scientist to assess the results, we depend on Azure AutoML to find the optimal algorithm + hyperparameters given the problem template and the customer’s data.

The other place that I’ve used AutoML is during the exploration phase of a new data science project. My initial modeling pass will most likely use local libraries in R or Python to understand feasibility. After determining that ML is a practical for the problem, I use AutoML to broaden the search space for the optimal algorithm. For an example of this, please see the results of an AutoML run on a time series problem with about 400 unique time series.

clip_image001

Discussion

I have to admit to being an AutoML skeptic when I first heard about it (note – this comment is targeting all AutoML implementations, not Azure specifically). There is a bit of snake oil magic to some of the claims regarding how AutoML will make machine learning available to anyone and everyone. Given data science’s relative youth, those claims don’t resonate with me.

I’m still skeptical about AutoML applications outside of the modeling phase of the lifecycle. For example, data preparation involves discovering the right data, merging datasets appropriately, making informed decisions about missing data, and applying domain knowledge to feature engineering. There are some basic things that software can do for the data scientist in the data preparation phase for well-known constructs. For example, expanding dates to features such as month, day, year, day of month, day of week, weekend / weekday, holiday, etc is a great application. But there are many other scenarios where the expertise of a data scientist will deliver the best results. Since good data is an essential part of a successful ML solution, I’m hesitant to leave this to a machine.

I’m also skeptical about AutoML making ML development available to the masses. Just take a look at the available parameters of Azure’s AutoMLConfig class (azureml.train.automl.automlconfig.AutoMLConfig class – Azure Machine Learning Python | Microsoft Docs); there are dozens! The Azure team does a good job of providing samples and tutorials to make this manageable, but it’s not for the casual developer much less a citizen data scientist or developer. There’s a better chance of success with the Azure ML GUI sitting on top of AutoML, but there still needs to be a basic understanding of the process and interpretation of the results to ensure you have a solution that is technically correct, unbiased, explainable, deployable, and more.

Despite some skepticism regarding how AutoML is used and by whom, it should be part of a data scientist’s toolbox. It’s an important part of our “templated ML” Dynamics 365 Finance Insights solution. AutoML is a time saver and is better at optimizing for the optimal algorithm + hyperparameters than a for loop that I might hand code. And it gets better with time. For example, several of the time series algorithms shown above didn’t exist in Azure AutoML a month ago.

As with any tool, recognize where AutoML can be best applied, learn how to use it, and reap the productivity benefits.

Additional References

https://arxiv.org/pdf/1810.13306.pdf – “Taking the Human out of Learning Applications:  A Survey on Automated Machine Learning”

Picture details:  Lake Superior, 8/6/2019, Canon PowerShot SD4000 IS, f5.6, 1/800 s, ISO-1600, –1 step