Industry executives and experts share their predictions for 2022. Read them in this 14th annual VMblog.com series exclusive.
OpenFlake - The Open Data Lake for Warehouse Workloads
By Dipti Borkar, Cofounder and Chief Product Officer, Ahana
As the debate has waged on over the Cloud Data
Warehouse versus the Cloud Data Lake, why not take the best of both? Snowflake
is an incredible cloud data warehouse that helped many users transition from
the on prem world to the cloud and simplified the complexity of the traditional
warehouses. But some of the other challenges with lock in and high costs still
stay. Welcome to the Open Data Lake for Warehouse Workloads or what I'm calling
"OpenFlake". This new, modern stack will embrace the best of both worlds - SQL
analytics, security and governance, and transaction support of the data
warehouse and openness, flexibility, and lower costs of the data lake. In 2022,
all warehouses will be augmented with an open data lake for analytics with some
users skipping the warehouse all together.
So what
is this Open Data Lake Stack?
The Open Data Lake stack includes an open
source query engine (like Presto) that can run all queries that run on a data
warehouse, a catalog for database metadata (like AWS Glue or Hive Metastore), a
layer to support governance (AWS Lake Formation or Apache Ranger), a
transaction manager also called table formats to allow for updates on object
stores as well as ACID support (Apache Hudi, Delta Lake, Apache Iceberg), and
cloud storage (AWS S3) which can store many different open formats. The various
components including compute and storage are disaggregated enabling much more
flexibility, better performance, and greater scale. Open Source technologies
underpin this stack.
Digging into the layers of this stack, we
start with the storage technology at the bottom. Most commonly we see AWS S3
data lake as the place where many are storing their data - it's cheap, it's
ubiquitous, and it's easy to use.
You add a transaction layer on top, the most
popular we see are Apache Hudi, Delta Lake, and Apache Iceberg. This gives you
the ability to bring record-level updates/deletes and transactionality to your
data lake which up until now was not possible.
Next you have your governance layer which
allows for fine-grained access to your data, enabling security policies to have
better access control across your company. AWS Lake Formation makes it easy for
you to set up a secure data lake very quickly, and comes pre-integrated with
many of these layers in this stack. Another good option is Apache Ranger.
Next is the SQL query engine. This is the
heart of the stack. The query engine is the one that leverages all these
components together and brings true value to customers by pulling out insights
from data at interactive speeds. Presto is becoming the de facto open source SQL query engine for the data
lake. Services like Ahana Cloud for Presto make it very easy to run, deploy and
manage Presto in AWS and come pre-integrated with any of these technologies -
AWS Lake Formation or Apache Ranger, AWS Glue or Hive Metastore, any table
format, AWS S3. Plus you can bring your
own BI or reporting or data science tool.
In addition to SQL, open data lakes allow you
to run other workloads on top of the same data without moving it around.
Machine learning tools like TensorFlow now support open formats like Apache
Parquet as well. With the open source,
open formats, open interfaces like SQL and with the ability to run on any
cloud, you can see that this stack is much more flexible than the traditional
cloud data warehouse - customers can use open source technologies for many of
these layers, giving them much more freedom and cost savings as opposed to a
single vendor's much more expensive solution.
We're already seeing customers feel the burden
of their cloud data warehouse's (CDW) rising costs. The Open Data Lake stack
addresses that, and gives them more flexibility and openness that cloud data warehouses
just don't provide. We'll see the stack emerge as a much more widely used
architecture for analytics and AI, simplifying data infra and accelerating
business innovation. Not only will this give companies more flexibility and
cost savings, but it will also unlock even more workloads that before couldn't
be run on the traditional cloud data warehouse, enabling data platform teams to
support a wider range of data applications.
In 2022, not only will users augment their
data warehouses, with the innovations of the open data lake, some will skip
them all together. Welcome to the new OpenFlake world!
##
ABOUT THE AUTHOR
Dipti
Borkar, Cofounder and Chief Product Officer, Ahana
Dipti is a Cofounder and CPO of Ahana with over
15 years experience in distributed data and database technology including
relational, NoSQL and federated systems. She is also the Presto Foundation Outreach Chairperson. Prior to Ahana, Dipti held VP roles at Alluxio,
Kinetica and Couchbase. At Alluxio, she was Vice President of Products and at
Couchbase she held several leadership positions there including VP, Product
Marketing, Head of Global Technical Sales and Head of Product Management.
Earlier in her career Dipti managed development teams at IBM DB2 Distributed
where she started her career as a database software engineer. Dipti holds a
M.S. in Computer Science from UC San Diego, and an MBA from the Haas School of
Business at UC Berkeley.