Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics
Hi all!
I’ve been wanting to continue to write about papers that I think is interesting and relevant to infrastructure in general, and I want to start off with this paper from Databricks that has been creating a new buzz / marketing around how data management of the future looks like. It’s even more fun given there is a war between Snowflake and Databricks fighting back and forth on who’s faster and cheaper, looking to reiterate the Lakehouse shift.
In this post I’ll walk through some major points about the paper and talk about some my own observations about this space as well.
The world of lakes + warehouses
Let’s talk about what’s being described in the paper as three generations of data warehouses.
The first generation is what we knew as the run your own data warehouse software era, which means enterprises buying appliances that coupled storage and compute and leads to hard to manage cost and inability to handle various data formats.
The second generation is the Hadoop era, which we all knew inspired by the Google Bigtable and led to the great Hortonworks vs Cloudera times, which essentially leads to what we all know as the “Data Lake”, that compute and storage is decoupled and data is mostly stored in HDFS (and now cloud storage), and ETL pipelines will be writing data from storage into specific BI solutions and warehouses to serve specific needs. This also leads to the spawn (and big debate) around Lambda architecture, and the rise of Kafka.
So, what’s wrong with the current generation then?
Here’s what’s the main problems the paper states about the current architecture:
Reliability → The ETL layer that needs to run to keep the Data Lake in sync with the warehouse is introducing more potential quality issues and also engineering cost to maintain.
Data Staleness → Another effect of requiring ETL is essential your data is out of sync of what’s the source of truth that’s stored in your Data lake.
Limited Support for advanced analytics → With the rise of ML, data science teams increasingly want to run arbitrary frameworks on top of their data to answer more complex questions or power user facing features. However, data in data warehouse after being ETL is usually locked in a proprietary format and also data scientists don’t have direct access to the source, which requires another ETL from the warehouse into files that can be loaded into frameworks like Pytorch.
Total cost of ownership → Besides paying for more engineering costs and ETL operation cost, the amount of storage requires given the duplication data also increases.
In summary, having to copy data with a ETL system creates operational burden with additional data staleness and cost challenges.
Data lake + warehouse = Lakehouse
So is there a way to avoid having a extra ETL to serve all the warehouse + ML + reporting need?
This is where the Lakehouse architecture comes in, which in the paper is defined as providing a data management system directly on top of low cost and directly-accessible storage, that combines the benefits of a Data lake and a Data Warehouse.
The first problem to solve without having a “closed-world” data warehouse, is to have the ability to have various warehouse and ingestion systems able to write to Data Lake safely. To solve this problem, the first layer at the bottom of the stack is to introduce a shared Metadata layer that sits below all the writers, which provides additional metadata on next to the raw data stored in the Data Lake. This additional metadata file helps provide a consistent state that allows arbitrary and multiple data frameworks to concurrently write with strong transaction guarantees (aka ACID), also governance and auditing, etc. There are several implementations of this layer that designed a new metadata file format, such as Apache Iceberg (commercially Tabular) which was born out of Netflix, Delta Lake built by Databricks
The second problem to solve is to solve the SQL performance problem. Data warehouse systems today like Redshift, Snowflake and others requires ETL to ingest data because it transforms the data into a internal format that optimizes for its own engine execution for faster and better performance, and also builds up internal indexes for faster random access and filtering. Now without an ETL layer, we essentially need to decouple this process and have a open standard format and index that also provides similar optimizations that is format independent and other query engines can understand and leverage.
Here’s the list of optimizations mentioned in the paper:
Caching:
One optimization that data warehouses typically do serving queries from cloud object storage is to cache data in local SSD or memory. Creating a caching layer with data format that can be natively understood by open ecosystem warehouses can potential improve the performance of the processing nodes, and also potentially benefit multiple frameworks. The paper specifically listed Delta Engine’s example of how it can store its cached data with partially decompressed parquet data, so it can have a faster loading time without extra work.
There is an opportunity here to also see how a general in-memory format like Apache Arrow can become a universal caching data format for multiple query engines. Similar projects has exist in the past such as Alluxio which focuses more on supporting heterogeneous storage systems but also provides a caching layer to speed up performance.
Auxiliary Data:
Auxiliary data such as statistics (e.g: column1 has min value X and max value Y) stored in the open data formats like Parquet, or metadata files like Iceberg or Delta Lake Transaction log can be leveraged to help query optimizers know how to skip over data that isn’t necessary to load and process.
Additionally, indexes can be created that’s stored in object storage to facilitate more efficient data skipping such as Bloom filter in Delta engine.
Data layout:
Another optimization is to decide how to group rows that’s stored in object storage, since each parquet file can contain multiple rows. To improve performance, the query engine ideally wants to skip as much files without needing to read them from storage. Delta engine implemented using z-order on specific columns to help provide less value overlap and more evenly distribute the grouping, so the query engine can more effectively skip files. More future work can be determining compression strategies based on data stored / accessed at runtime, etc.
The last problem is to provide efficient access to other analytic frameworks like ML.
Frameworks like scikit-learn or tensorflow doesn’t run SQL, yet can potentially access large amounts of data, especially using scalable data science frameworks like Dask.
The approach that the paper specifies is to expose a Dataframe API / SDK that translates Dataframe operations into Spark SQL executions so it leverages all the optimizations mentioned before like caching and data skipping. Some examples are Koalas from Databricks or Modin or Dask are all wrappers around existing Dataframe APIs but implements a different backend.
The biggest downsides of this approach is 1) ML users need to adopt the library 2) Only supports dataframes, which limits the support to projects like Pytorch or Tensorflow 3) Each implementation around the Dataframe API is engine and framework specific.
There is a lot more room to grow and explore in this side, as ML / AI isn’t all bottlenecked from data loading and execution, but more data transfer and general “AI tax” work. Also optimizations like pushing ML logic into SQL query optimization (aka factorized ML) are still continuing to eliminate redundant work on the ML side.
Closing thoughts
As mentioned in the paper, the Lakehouse architecture or direction will have different implementations and still many open questions and opportunities to continue to innovate.
It’s also an open question how would enterprise adopt a Lakehouse architecture which uses framework beyond one single vendor, as a lot of the trend today is moving towards a open data ecosystem, which starts from the file formats but also moving upwards into the compute stack and frameworks.
There are also open source projects like Apache Arrow that isn’t mentioned in the paper, that could also provide another level of standardization for cross tools optimization and access.
We’re also seeing standards being created from the dataframe level, SQL execution that further facilitates some newer cross stack optimizations opportunities.
It’s an interesting time in the data infrastructure space to watch how the ecosystem evolves.