Essential ingredients for a data science project

Frosty

This week marks the start of another year of Microsoft Fargo participating in the NDSU Computer Science Capstone Program as an industry partner. In this class, seniors in the computer science program work in teams of 4 with an industry partner on a project. I’ve led this engagement for Microsoft for many years.

In recent years, my goal has been exposing the students to the application of Microsoft technologies to data science projects. For example, two years ago the students built an employee retention prediction capability using Azure ML and two synthetic datasets. They also did some visualization using Power BI.

We’re trying something different this year by partnering with the Agricultural Engineering department on a project. The Computer Science capstone students will be working with some data from a ‘smart farm’ that NDSU owns with the objective of building some visualizations and predictive yield models. This aligns well with a new venture in the Agricultural Engineering department as they are just kicking off a Precision Agriculture major.  Microsoft is also working on a broader partnership with this program and we’re excited to be a part of it.

Bringing together these different resources (Computer Science department, Ag Engineering department, Microsoft) reinforces the need for what I see as the three essential ingredients for a successful data science project:

The business problem

Data science projects start with the business problem. To frame the problem properly, you need domain knowledge and a ‘data mindset’.

This ingredient ties back to my post on the data informed expert. In the Capstone project, a farmer who is either ignorant or dismissive of data would not be able to properly frame the problem. Nor would a data scientist who knows little about agriculture do a good job with this task. What is needed is someone versed in agriculture AND the possibilities of the data available. This is where the experts in the Agricultural Engineering department will play a critical role in the project.

The data

A seemingly obvious aspect of a data project is the data. We need enough relevant data that can be used cohesively to address the business problem.

This ingredient ties back to my post on BI before AI. In nearly all projects, the data will need to be wrangled in some manner to be useful. In the Capstone project, the Agricultural Engineering department owns a fair amount of relevant data from the ‘smart farm’ in 2018: yield, planting densities, soil measurements, aerial photography, etc.  A significant challenge will be using this information in a spatially cohesive manner.  It’s not unusual for data wrangling to take the majority of a project’s time.

The expertise

The last ingredient is the data science expertise to apply the data to solve the business problem in a reliable, robust, and ethical manner.

While this ingredient is the focus of data science education, it’s only one of the three essential ingredients. For the Capstone project, this is where Microsoft tooling, me, and the students come into play. The students will need to choose the right tools to do the job and apply them appropriately.

Bringing all three of these ingredients together is harder than it might seem. For example, the employee retention project had a good problem – how to retain top performers and understand why people leave. And we effectively applied Azure ML Studio and Power BI to the problem. However, the use of synthetic data resulted in some oddities when comparing different ML algorithms.

Problem + Data + Expertise = Data Science Success… we’ll see how this formula plays out over the next few months!

Picture details:  Frosty Morning, 1/15/2019, iPhone 7 Plus, f/2.8, 1/120, ISO-32