Virtualization Technology News and Information
Article
RSS
Starburst 2024 Predictions: Data Virtualization will Become a Core Component of Data Lakehouses in 2024

vmblog-predictions-2024 

Industry executives and experts share their predictions for 2024.  Read them in this 16th annual VMblog.com series exclusive.

Data Virtualization will Become a Core Component of Data Lakehouses in 2024

By Dr. Daniel Abadi, Darnell-Kanal Professor of Computer Science, University of Maryland, College Park and Andy Mott, EMEA head of partner solutions architecture and Data Mesh Lead, Starburst

Data lakehouses have emerged in the past five years as a hybrid middle ground between data lakes and data warehouses. For decades, data warehouses were the primary solution to bringing data across an organization together into a central, unified location for subsequent data analysis. Data are extracted from source data systems via an "extract, transform, and load" (ETL) process, integrated, and stored inside dedicated data warehouse software such as Oracle Exadata, Teradata, Vertica, or Netezza products, and made available for data scientists and data analysts to perform detailed analyses. These data warehouses stored data in highly optimized storage formats in order to analyze data at high performance and throughput, so that data analysts could experience near-interactive latency even when analyzing very large datasets. However, due to the large amount of complexity in the software, these data warehouse solutions were expensive and charged by the size of data stored; therefore they were often prohibitively expensive to store truly large datasets, especially when it was unclear in advance if these datasets would provide significant value to data analysis tasks. Furthermore, data warehouses typically required upfront data cleaning and schema declaration, which typically involved non-trivial human effort - which gets wasted if the dataset ends up not being used for analysis.

Data lakes therefore emerged as much cheaper alternatives to storing large amounts of data in data warehouses. Typically built via relatively simple free and open source software, the only cost of storing data in a home-built data lake was the cost of the hardware for the cluster of servers which were running the data lake software, and the labor cost of the employees overseeing this deployment. Furthermore, data could be dumped into data lakes without upfront cleaning, integration, semantic modeling, and schema generation, thereby making data lakes an attractive place to store datasets whose value for analysis tasks has yet to be determined. Data lakes allowed for a "store-first, organize later" approach.

Nonetheless, over time, some subset of the data in data lakes ends up proving to be highly valuable for data analysis tasks, and therefore the human effort to clean, integrate, and define schemas for it becomes justified. At this point, historically it was moved from the data lake to the data warehouse, despite the increased costs of storing data there. More recently, data lakehouses have emerged as an alternative approach to moving this data to a data warehouse. Rather, the data can remain in the data lake, stored using read-optimized open data formats, and a lakehouse (specialized software running on the data lake) handles the management of the schema, metadata, and other administrative functions that have historically been handled by the data warehouse.

In addition to managing the schema and metadata of data stored in the data lake, a data lakehouse typically also provides a query interface through which queries over data that it manages can run. These queries are run at high performance, in parallel across the servers in the data lake, using similar scalable query processing techniques used in high end data warehouses.

Unfortunately, today, data lakehouses are often limited to querying data that it manages within the data lake. They implement extremely powerful query engines, yet fundamentally provide a narrow view of data within an organization since they are only capable of querying data in a given data lake. Yet most organizations keep their most valuable and highly used datasets in traditional data warehouses or other high-end database management software. Therefore, queries run by the lakehouse necessarily must ignore the most valuable data owned by the organization.

Some of the newer approaches to implementing data lakehouse software include data virtualization technology that solve this problem by allowing lakehouse users to include data stored in external systems within queries over the lakehouse. At a high level, all data within the organization is virtualized - the lakehouse provides a unified interface through which all data within that organization can be queried and joined together - both data in the data lake and data in external systems such as traditional data warehouses. The lakehouse user can query this virtualized data as if it is physically all stored together, and the lakehouse software takes care of all of the complexities of combining data stored in physically different locations and stored in different types of software.

In 2024 we predict that data virtualization technology will become a core component of data lakehouse solutions. Data lakehouse query processing software has become extremely powerful, providing scalable query performance of petabytes of data stored in a data lake. It is such a waste if it cannot query an organization's most valuable datasets stored outside of the data lake! Data virtualization enables the lakehouse to reach a much higher level of its potential - deploying the high performance query processing software on a much broader set of data.

We are currently writing a book on data virtualization in the cloud era. In the book, we discuss some of the technical challenges behind data virtualization and how advances in networking hardware and machine learning technology have enabled data virtualization to work for modern applications in areas that did not work in the past. We specifically focus on data lake use cases, and the differences between pull-based systems (in which query processing is performed by the data virtualization software itself) vs. push-based systems (in which query processing is pushed down to the underlying systems that store data being virtualized). In reading the book, the reader will get a better understanding of how data virtualization software works, some practical pitfalls that data virtualization users may run into, and how to tune these systems to achieve better performance. This allows for better outcomes for data virtualization users. We also discuss how data virtualization fits into the modern data mesh and data fabric paradigms. You can download each chapter, for free, as it gets released at: https://www.starburst.io/info/oreilly-data-virtualization-in-the-cloud-era/.

##

ABOUT THE AUTHOR

Dr. Daniel Abadi, Darnell-Kanal Professor of Computer Science, University of Maryland, College Park

daniel abadi 

Professor Daniel Abadi performs research on database system architecture and implementation, especially at the intersection with scalable and distributed systems. He is best-known for the development of the storage and query execution engines of the C-Store (column-oriented database) prototype, which was commercialized by Vertica and eventually acquired by Hewlett-Packard in 2011, for his HadoopDB research on fault tolerant scalable analytical database systems which was commercialized by Hadapt and acquired by Teradata in 2014, and deterministic, scalable, transactional, distributed systems such as Calvin. Abadi is an ACM Fellow and has been a recipient of a Churchill Scholarship, a NSF CAREER Award, a Sloan Research Fellowship, a VLDB Best Paper Award, two VLDB Test of Time Awards (for the work on C-Store and HadoopDB), the 2008 SIGMOD Jim Gray Doctoral Dissertation Award, the 2013-2014 Yale Provost's Teaching Prize, and the 2013 VLDB Early Career Researcher Award. He was the PhD dissertation advisor of Alexander Thomson's and Jose Falerio's PhD dissertations, both of which won SIGMOD Jim Gray Doctoral Dissertation Awards (in 2015 and 2020 respectively).

Andrew Mott, EMEA head of partner solutions architecture and Data Mesh Lead, Starburst

Andy Mott 

With more than 20 years of experience in data analytics, Andrew Mott is skilled at optimizing the utility of analytics within organizations. When determining how to generate value or fortifying existing revenue through technologies, Andrew considers the alignment of an organization's culture, structure and business processes. He ensures that the strategic direction of the organization will ultimately enable organizations to out compete their respective markets. Andrew Mott is currently EMEA head of partner solutions architecture and Data Mesh Lead at Starburst, and lives in the United Kingdom.

Published Thursday, January 11, 2024 7:38 AM by David Marshall
Comments
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
Calendar
<January 2024>
SuMoTuWeThFrSa
31123456
78910111213
14151617181920
21222324252627
28293031123
45678910