Data products

Data science is often described as the intersection of domain, statistics, and computer science. When thinking about computer science, most are likely focused on the programming element. Another very important aspect is the software engineering discipline that has developed in the software world over multiple decades. There are several processes, practices, and tools that mature data organizations should be using from software engineering.

One way to bring software engineering into data science is to focus on building data products. Here’s my definition:

A data product is a data driven capability that enables targeted users to make better decisions on an ongoing basis.

One example of a data product is a report that provides status on a key metric or information that leads to an action. Another example is a machine learning capability that either directly takes an action or provides more information to make a decision. Using this definition, a data product could target internal users (key metrics, for example) or external users (features in a software product).

Thinking about the development and maintenance of data products as defined above helps to bring in several aspects of software engineering:

Focusing on the user and the decisions enabled by the product drives the right outcome. It’s easy to get lost in the analysis phase of a data project and forget what is needed. Different users have different needs. While the same information might be useful for an executive as an engineer, their decisions are different and that likely requires a different view on the information… and potentially a completely new data product. Software engineering tools such as personas, use cases, scenarios, and lean thinking apply here.
Products have lifecycles – development, introduction, maintenance, and retirement. In contrast, some data deliverables like reports seem to appear out of nowhere. No one knows who is responsible for them, so they are not maintained and no one knows for sure if and when they can be retired.
Lastly, a product focus brings in several practices of product development that are important for a mature data organization. These practices address compliance, source code control, configuration management, DevOps, documentation, and more. While there is starting to be more focus here, it still feels pretty early in my experience. For example, while it might have been up for debate 20 years ago in software engineering, developing software without source code control and basic configuration management would be an anomaly today. This is not unusual for data science projects.

There’s plenty of opportunity to elaborate on the application of software engineering process, practices, and tools. While I plan to do that in future posts, I’ll point you to an excellent book today that addresses data products as one aspect of overall organizational data culture. The book is Data Fluency, by Zach Gemignani and Chris Gemignani, the founders of Juice Analytics. We used this book in the UW Data Communications class (DS 735) and it’s full of practical advice. I encourage you to check it out!

Picture details: Oberg Lake (Lake Superior North Shore), 10/3/2018, Canon EOS Digital Rebel, F7.1, 1/200 s, ISO-400, –0.7