From Data Warehouse to Data Lake to Data Lakehouse

data-lakeFirst came the traditional enterprise data warehouse (EDW). Structured data is integrated into an EDW from external data sources using ETLs (check out my recent blog post on this). The data can then be queried by end-users for BI and reporting. EDWs were purpose built for BI and reporting. But with the growing desire to incorporate more data, of different types, from different sources, of different change rates, the traditional EDW has fallen short. It does not support unstructured data (i.e., video, audio, unstructured text, etc.), streaming is for the most part out of the question, there is no data science or machine learning that can be done directly on the data, and because of their closed/proprietary nature, costs quickly skyrocket as organizations scale their deployments. Modern, cloud-based EDWs have looked to address several of these challenges and done a good job of it, but some challenges still remain, with the obvious being lack of unstructured data support.

The data lake came along with a promise of handling all data, the more, the better. Place it all in one location, in an open format for when you’re ready to use the data. And when you’re ready for data science and machine learning, the data and tool integrations are ready and available. But a new set of challenges presented itself. Setup and maintenance were hard, operational costs skyrocketed, support for BI was limited, and ensuring trust in high quality data was difficult. On top of that, many viewed data lakes as a way to consolidate data silos, but what happened was a new data silo was being created with the same data residing in a data lake and an EDW. The aspirational data lake was turning into a data swamp. Today, data lakes are seeing a resurgence as businesses look to incorporate as much data as possible into their decision-making. And a wave of both mature and new vendors are looking to deliver on the initial promise of a data lake, re-architecting platforms, offering managed services, incorporating data virtualization, and more tightly integrating BI, data visualization, data science, and machine learning platforms.

Shouldn’t organizations have the best of both worlds? Enter the data lakehouse. Businesses should get the BI and reporting capabilities they’re used to from an EDW and they should be able to easily and cost-effectively bring all data to one place in an open format that helps avoid vendor lock-in and sets them on a path to easily incorporate advanced data science, AI, and ML. Sounds too good to be true. Well it’s not. Or at least it won’t be. Not if Databricks has a say.

At this week's Spark and AI Summit, Databricks doubled down on their vision of ubiquitous data access and analysis to fuel data science. Over the years, Databricks fostered the adoption of Spark. They’ve incorporated data science and AI into their platform. They announced Delta Lake at last year’s Spark and AI Summit, a highly reliable and high-quality data lake using an open transactional layer on top of existing data with core data warehouse principles, like schema enforcement and ACID compliance. And yesterday they announced Delta Engine. OK, so I guess in hindsight, they’ve quadrupled down on their vision.

Delta Engine is a high-performance query engine that looks to address the common performance challenges experienced with data lakes, especially when running interactive queries with a focus on enhancing the performance of SQL and data frame workloads. It consists of three main components:

  1. A native, vectorized execution engine is a rewritten MPP query engine that enables support for modern hardware, including the ability to execute single instructions across multiple data sets. Called Photon, the goal is to improve performance for all workload types, while remaining fully compatible with open Spark APIs.
  2. An improved query optimizer that extends Spark's cost-based optimizer and Spark's adaptive query execution optimizer with more advanced statistics. Databricks claims up to 18x faster performance for star schema workloads.
  3. An intelligent caching layer that sits between the execution engine and core object storage of the data lake that automatically chooses the right data to cache for the end-user. It transcodes data into a more CPU-efficient format that enables faster decoding for future query processing. Databricks claims up to 5x scan performance improvements.

Now the big question is whether the market is ready for a solution like this. As we track the maturity of organizations in how they leverage data-centric technologies to enable better, faster access to data, Databricks’ introduction of Delta Engine enables them to offer a complete technology stack to address the data lakehouse paradigm. But let it be known that the wounds of the initial data lake adopters are still fresh. Hesitancy remains in dedicating time and heavily investing in what is yet another approach to the ultimate data experience nirvana. And while a unified analytics platform is appealing to many businesses, all vendors that offer an approach to unified analytics have gaps, whether multi-cloud/hybrid cloud support, workload support, persona support, technology integrations, open source support, etc.

There will be some who look at this announcement as an “it is too good to be true.” And for some, it may turn out that way. But I would not bet against Databricks. A growing percentage of their customers (I think it’s around 45% today, but don’t quote me on that) are on Delta Lake. The natural next step for these customers is to leverage Delta Engine. Never mind the fact that I’m seeing more and more vendors talk about the data lakehouse paradigm.

Topics: Data Platforms, Analytics, & AI