Virtualization Technology News and Information
Article
RSS
Ahana 2022 Predictions: OpenFlake - The Open Data Lake for Warehouse Workloads

vmblog predictions 2022 

Industry executives and experts share their predictions for 2022.  Read them in this 14th annual VMblog.com series exclusive.

OpenFlake - The Open Data Lake for Warehouse Workloads

By Dipti Borkar, Cofounder and Chief Product Officer, Ahana

As the debate has waged on over the Cloud Data Warehouse versus the Cloud Data Lake, why not take the best of both? Snowflake is an incredible cloud data warehouse that helped many users transition from the on prem world to the cloud and simplified the complexity of the traditional warehouses. But some of the other challenges with lock in and high costs still stay. Welcome to the Open Data Lake for Warehouse Workloads or what I'm calling "OpenFlake". This new, modern stack will embrace the best of both worlds - SQL analytics, security and governance, and transaction support of the data warehouse and openness, flexibility, and lower costs of the data lake. In 2022, all warehouses will be augmented with an open data lake for analytics with some users skipping the warehouse all together.

So what is this Open Data Lake Stack?

The Open Data Lake stack includes an open source query engine (like Presto) that can run all queries that run on a data warehouse, a catalog for database metadata (like AWS Glue or Hive Metastore), a layer to support governance (AWS Lake Formation or Apache Ranger), a transaction manager also called table formats to allow for updates on object stores as well as ACID support (Apache Hudi, Delta Lake, Apache Iceberg), and cloud storage (AWS S3) which can store many different open formats. The various components including compute and storage are disaggregated enabling much more flexibility, better performance, and greater scale. Open Source technologies underpin this stack.

open-datalake-stack 

Digging into the layers of this stack, we start with the storage technology at the bottom. Most commonly we see AWS S3 data lake as the place where many are storing their data - it's cheap, it's ubiquitous, and it's easy to use.

You add a transaction layer on top, the most popular we see are Apache Hudi, Delta Lake, and Apache Iceberg. This gives you the ability to bring record-level updates/deletes and transactionality to your data lake which up until now was not possible.

Next you have your governance layer which allows for fine-grained access to your data, enabling security policies to have better access control across your company. AWS Lake Formation makes it easy for you to set up a secure data lake very quickly, and comes pre-integrated with many of these layers in this stack. Another good option is Apache Ranger.

Next is the SQL query engine. This is the heart of the stack. The query engine is the one that leverages all these components together and brings true value to customers by pulling out insights from data at interactive speeds. Presto is becoming the de facto  open source SQL query engine for the data lake. Services like Ahana Cloud for Presto make it very easy to run, deploy and manage Presto in AWS and come pre-integrated with any of these technologies - AWS Lake Formation or Apache Ranger, AWS Glue or Hive Metastore, any table format,  AWS S3. Plus you can bring your own BI or reporting or data science tool.

In addition to SQL, open data lakes allow you to run other workloads on top of the same data without moving it around. Machine learning tools like TensorFlow now support open formats like Apache Parquet as well.  With the open source, open formats, open interfaces like SQL and with the ability to run on any cloud, you can see that this stack is much more flexible than the traditional cloud data warehouse - customers can use open source technologies for many of these layers, giving them much more freedom and cost savings as opposed to a single vendor's much more expensive solution.

We're already seeing customers feel the burden of their cloud data warehouse's (CDW) rising costs. The Open Data Lake stack addresses that, and gives them more flexibility and openness that cloud data warehouses just don't provide. We'll see the stack emerge as a much more widely used architecture for analytics and AI, simplifying data infra and accelerating business innovation. Not only will this give companies more flexibility and cost savings, but it will also unlock even more workloads that before couldn't be run on the traditional cloud data warehouse, enabling data platform teams to support a wider range of data applications.

In 2022, not only will users augment their data warehouses, with the innovations of the open data lake, some will skip them all together. Welcome to the new OpenFlake world!

##

ABOUT THE AUTHOR

Dipti Borkar, Cofounder and Chief Product Officer, Ahana

Dipti Borkar 

Dipti is a Cofounder and CPO of Ahana with over 15 years experience in distributed data and database technology including relational, NoSQL and federated systems. She is also the Presto Foundation Outreach Chairperson. Prior to Ahana, Dipti held VP roles at Alluxio, Kinetica and Couchbase. At Alluxio, she was Vice President of Products and at Couchbase she held several leadership positions there including VP, Product Marketing, Head of Global Technical Sales and Head of Product Management. Earlier in her career Dipti managed development teams at IBM DB2 Distributed where she started her career as a database software engineer. Dipti holds a M.S. in Computer Science from UC San Diego, and an MBA from the Haas School of Business at UC Berkeley.

Published Wednesday, January 05, 2022 7:32 AM by David Marshall
Comments
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
Calendar
<January 2022>
SuMoTuWeThFrSa
2627282930311
2345678
9101112131415
16171819202122
23242526272829
303112345