Industry executives and experts share their predictions for 2024. Read them in this 16th annual VMblog.com series exclusive.
Data Virtualization will Become a Core Component of Data Lakehouses in 2024
By Dr. Daniel Abadi, Darnell-Kanal Professor of Computer
Science, University of
Maryland, College Park and Andy Mott, EMEA head of partner solutions
architecture and Data Mesh Lead, Starburst
Data lakehouses have
emerged in the past five years as a hybrid middle ground between data lakes and data warehouses. For decades, data warehouses were the primary
solution to bringing data across an organization together into a central,
unified location for subsequent data analysis. Data are extracted from source
data systems via an "extract, transform, and load" (ETL) process, integrated,
and stored inside dedicated data warehouse software such as Oracle Exadata,
Teradata, Vertica, or Netezza products, and made available for data scientists
and data analysts to perform detailed analyses. These data warehouses stored
data in highly optimized storage formats in order to analyze data at high
performance and throughput, so that data analysts could experience
near-interactive latency even when analyzing very large datasets. However, due
to the large amount of complexity in the software, these data warehouse
solutions were expensive and charged by the size of data stored; therefore they
were often prohibitively expensive to store truly large datasets,
especially when it was unclear in advance if these datasets would provide
significant value to data analysis tasks. Furthermore, data warehouses
typically required upfront data cleaning and schema declaration, which
typically involved non-trivial human effort - which gets wasted if the dataset
ends up not being used for analysis.
Data lakes therefore emerged as much cheaper alternatives
to storing large amounts of data in data warehouses. Typically built via
relatively simple free and open source software, the only cost of storing data
in a home-built data lake was the cost of the hardware for the cluster of
servers which were running the data lake software, and the labor cost of the
employees overseeing this deployment. Furthermore, data could be dumped into
data lakes without upfront cleaning, integration, semantic modeling, and schema
generation, thereby making data lakes an attractive place to store datasets
whose value for analysis tasks has yet to be determined. Data lakes allowed for
a "store-first, organize later" approach.
Nonetheless, over time, some subset of the data in data
lakes ends up proving to be highly valuable for data analysis tasks, and
therefore the human effort to clean, integrate, and define schemas for it
becomes justified. At this point, historically it was moved from the data lake
to the data warehouse, despite the increased costs of storing data there. More
recently, data lakehouses have emerged as an alternative approach to moving
this data to a data warehouse. Rather, the data can remain in the data lake,
stored using read-optimized open data formats, and a lakehouse (specialized
software running on the data lake) handles the management of the schema,
metadata, and other administrative functions that have historically been
handled by the data warehouse.
In addition to managing the schema and metadata of data
stored in the data lake, a data lakehouse typically also provides a query
interface through which queries over data that it manages can run. These
queries are run at high performance, in parallel across the servers in the data
lake, using similar scalable query processing techniques used in high end data
warehouses.
Unfortunately, today, data lakehouses are often limited to
querying data that it manages within the data lake. They implement extremely
powerful query engines, yet fundamentally provide a narrow view of data within
an organization since they are only capable of querying data in a given data
lake. Yet most organizations keep their most valuable and highly used datasets
in traditional data warehouses or other high-end database management software.
Therefore, queries run by the lakehouse necessarily must ignore the most
valuable data owned by the organization.
Some of the newer approaches to implementing data
lakehouse software include data virtualization technology that solve this
problem by allowing lakehouse users to include data stored in external systems
within queries over the lakehouse. At a high level, all data within the
organization is virtualized - the lakehouse provides a unified interface
through which all data within that organization can be queried and joined
together - both data in the data lake and data in external systems such as
traditional data warehouses. The lakehouse user can query this virtualized data
as if it is physically all stored together, and the lakehouse software takes
care of all of the complexities of combining data stored in physically
different locations and stored in different types of software.
In 2024 we predict that data virtualization technology
will become a core component of data lakehouse solutions. Data lakehouse query
processing software has become extremely powerful, providing scalable query
performance of petabytes of data stored in a data lake. It is such a waste if
it cannot query an organization's most valuable datasets stored outside of the
data lake! Data virtualization enables the lakehouse to reach a much higher
level of its potential - deploying the high performance query processing
software on a much broader set of data.
We are currently writing a book on data virtualization in
the cloud era. In the book, we discuss some of the technical challenges behind
data virtualization and how advances in networking hardware and machine
learning technology have enabled data virtualization to work for modern
applications in areas that did not work in the past. We specifically focus on
data lake use cases, and the differences between pull-based systems (in which
query processing is performed by the data virtualization software itself) vs.
push-based systems (in which query processing is pushed down to the underlying
systems that store data being virtualized). In reading the book, the reader
will get a better understanding of how data virtualization software works, some
practical pitfalls that data virtualization users may run into, and how to tune
these systems to achieve better performance. This allows for better outcomes
for data virtualization users. We also discuss how data virtualization fits
into the modern data mesh and data fabric paradigms. You can download each
chapter, for free, as it gets released at: https://www.starburst.io/info/oreilly-data-virtualization-in-the-cloud-era/.
##
ABOUT THE AUTHOR
Dr. Daniel Abadi,
Darnell-Kanal Professor of Computer Science, University of
Maryland, College Park
Professor Daniel
Abadi performs research on database system architecture and implementation,
especially at the intersection with scalable and distributed systems. He is
best-known for the development of the storage and query execution engines of
the C-Store (column-oriented database) prototype, which was commercialized by
Vertica and eventually acquired by Hewlett-Packard in 2011, for his HadoopDB
research on fault tolerant scalable analytical database systems which was
commercialized by Hadapt and acquired by Teradata in 2014, and deterministic,
scalable, transactional, distributed systems such as Calvin. Abadi is an ACM
Fellow and has been a recipient of a Churchill Scholarship, a NSF CAREER Award,
a Sloan Research Fellowship, a VLDB Best Paper Award, two VLDB Test of Time
Awards (for the work on C-Store and HadoopDB), the 2008 SIGMOD Jim Gray
Doctoral Dissertation Award, the 2013-2014 Yale Provost's Teaching Prize, and
the 2013 VLDB Early Career Researcher Award. He was the PhD dissertation
advisor of Alexander Thomson's and Jose Falerio's PhD dissertations, both of
which won SIGMOD Jim Gray Doctoral Dissertation Awards (in 2015 and 2020
respectively).
Andrew Mott, EMEA head of partner
solutions architecture and Data Mesh Lead,
Starburst
With more than 20 years of experience in
data analytics, Andrew Mott is skilled at optimizing the utility of analytics
within organizations. When determining how to generate value or fortifying
existing revenue through technologies, Andrew considers the alignment of an
organization's culture, structure and business processes. He ensures that the
strategic direction of the organization will ultimately enable organizations to
out compete their respective markets. Andrew Mott is currently EMEA head of
partner solutions architecture and Data Mesh Lead at Starburst, and lives in
the United Kingdom.