Tools of the trade: Resolving missing data in a time series

IMG_1433

This is the seventh post in a series of blog posts using a theme of “Tools of the trade”. The series targets software tools, statistical concepts, data science techniques, or related items. In all cases, the topic will contribute to accomplishing data science tasks. The target audience for the posts is engineers, analysts, and managers who want to build their knowledge and skills in data science, particularly those in the Microsoft Dynamics ecosystem.

This post continues discussions of time series data. The focus of this post is how to address data that is missing in the time series. Past posts related to time series are shown below:

What is it?

Missing data in a time series is self explanatory. It’s simply data that is not present in the dataset for one reason or another.

The ‘one reason or another part’ of the previous statement is where things get interesting. There are three categories of missing data… with names that I find more than a little bit confusing. Here’s my simplistic interpretation (see the references section for more complete explanations):

  • Missing at random (MAR) – I can explain the missing value by reviewing other observations.
  • Missing completely at random (MCAR) – I have no idea why this value is missing.
  • Missing not at random (MNAR) – I know exactly why this data point is missing.

The category of the missing data leads to guidance about what to do with the data point. Some options include:

  • Throw out the data point.
  • Leave the data point in and replace with a known value. For some of our Finance implementations, that known value is 0 as the missing data point is MNAR… and we know it should be 0.
  • Impute the missing value – there are a number of ways to replace the missing value with another value. The approaches vary from trivial (use the last value) to very sophisticated (develop an ML model to impute values). Domain insight is important when determining the best imputation approach.

How do I use it?

Missing values are part of the data messiness that is typical with most data science projects. Working with ERP data can help since it is structured. Working with financial data in an ERP system is even better since that data is often required. Unfortunately, there are challenges even in the ERP scenarios and we’re usually sourcing data from other places as well.

My approach is two fold:

  • Avoid throwing out data.
  • Pick the simplest imputation approach that is reasonable.

Discussion

In my last post, I noted that there was a missing data point in my electrical utility time series data. This is in August, 2019, and I know exactly why it is missing. We changed budgeting programs around that time and must have had a gap in recording information. So this is clearly a MNAR situation.

, Ony

There are multiple options that are reasonable for this situation:

  1. Use the July 2019 value (Last Observation Carried Forward, or LOCF)
  2. Use the September 2019 value (Next Observation Carried Backward, or NOCB)
  3. Use the average of July and September data (linear interpolation)
  4. Use the average of August data from the previous years

The Minnesota weather pattern is such that August is typically more like July than September, so I would be in inclined to go with either #1 or #4. Using only the 2019 value will better accommodate trends whereas using August data from previous years is better for seasonality. Since seasonality is the dominant characteristic of this dataset, I think #4 is a more general solution.

Using the average of the August values for 2016-2018 for 2019 yields the following plot.

500 
400 
200 
Jun '17 
Aug '16 
Apr '18 
Feb '19 
Dec-19

Now that we have our data ready to go, we can get into forecasting in the next post.

References

https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4

https://en.wikipedia.org/wiki/Missing_data

Picture details:  Starship cloud, 6/1/20, iPhone 7, f1.8, 1/2398, ISO-20