A data lakehouse is a new term showing up on the analytical tools scene. It aims to combine the best parts of a data warehouse and a data lake. Is it tool vendor hype or the next great thing in business intelligence? Given the name of this site is Lake Data Insights and I physically live in a lake house, it’s an important topic for this blog!
Warehouse
Before we jump into the lakehouse, let’s take a step back to provide an overview of warehouses and lakes.
The data warehouse is the senior member of this trio as goes back to the early 90’s when Bill Inmon and Ralph Kimball were developing their leading edge ideas for the data warehouse. Its goal is make business information readily available to facilitate better decision making. Operational systems are siloed and built for transactions. A warehouse brings together data from many systems and is built with a data schema optimized for slicing and dicing the business data in interesting ways. When done well, the warehouse will have excellent query performance and be able to handle significant load from reporting systems and ad hoc needs.
“When done well” is the key phrase in the last sentence. Businesses require agility for decision making, but data warehouses are far from agile in most situations. They depend on what’s known as the Extract-Transform-Load (ETL) process to bring data into the warehouse. This ETL process can be costly to build and maintain. Adding new sources is non-trivial. But you’re going to want to be able to quickly add new sources as business needs change. Looking at the data in new ways may require a schema change, and that can be another source of cost and time. Finding the right balance of cost, stability, and agility is a challenge for warehouses.
Lake
On the other hand, a data lake is built to bring all your data together into a single place. Over the last 10 years, the data lake has proven to be a cost effective way to store data. The challenge with the data lake is accessing and governing the data. Spark based technology is frequently used with data lakes. That is much less common toolset than the T-SQL language frequently used to access a data warehouse and support the ETL process. The data needs to be properly secured and cataloged. If it isn’t, it can become what is known as a data swamp. And nobody wants to be near a swamp!
Along the way to the lakehouse is the concept of the ‘modern data warehouse’ that is a two tier approach using both a data lake and a warehouse. This is a capable duo, but can be complex given the technologies involved.
Lakehouse and Azure Synapse
Finally, we get to the lakehouse. Like most fledgling technologies, there are multiple definitions and architectures floating around (see the links below for details). But all of the approaches aim to combine the good parts of warehouses and lakes under a single roof. While there are some skeptics out there, most of the content that I’ve seen support the need for the new lakehouse concept / architecture. Given the challenges of lakes and warehouses along with the promise of the lakehouse, I concur.
Azure Synapse is a product that I’ve been using some lately and it’s clear that it overlaps with the lakehouse concept (see Data Lakehouse & Synapse | James Serra’s Blog for a deeper dive). For example, the Synapse ability to apply serverless SQL to data in the lake means that I can use the same T-SQL statements to do the following:
- Execute a SQL view that queries data from a normalized set of CSV files sitting in the lake. This is data that has been exported from Dynamics 365 Finance in its ‘raw’ form.
- Query the materialized form of the data from the previous bullet, with the data stored in a parquet file in the data lake.
- Query the materialized form of this same data stored in a dedicated SQL data warehouse.
As the example above describes, the tooling to access the lake and the warehouse have become blurred. If you need performance, you can build an ETL process to bring data into a warehouse. If you need access to additional data that your business suddenly needs, you can get to that in the lake. And it’s all under one roof in Synapse Studio.
Serverless SQL and the uniform use of T-SQL are important benefits of Synapse. This is one of the key values of the lakehouse concept and I look forward to seeing how this evolves in the coming months.
References and more information:
Data Lakehouse & Synapse | James Serra’s Blog
Data Lakehouse defined | James Serra’s Blog
The Data Lakehouse – Dismantling the Hype — Advancing Analytics
What is a Lakehouse? – The Databricks Blog
Eckerson Group Webinar -Shop Talk – the Data Lakehouse – April 17, 2020
Picture details: 10/9/2020, Canon PowerShot G3 X, f4.5, 1/1250, ISO-125