Industry executives and experts share their predictions for 2021. Read them in this 13th annual VMblog.com series exclusive.
The Emergence of the Modern Data Architecture
By Tomer Shiran, CPO and co-founder of Dremio
In 2021, organizations will start to
implement modern data architectures that both accelerate analytics and keep
costs under control. Major trends will emerge that make modern cloud data lakes
the center of gravity for data architectures. These trends challenge a decades
old standard that, in order to query and analyze data, data engineers need to
extract and load it into a costly, proprietary data warehouse. Furthermore, the
increased need for cost control, security and data governance will shift the power
to centralized data teams.
Separation
of Compute and Data Becomes the Standard
For
years, the industry has talked about the separation of compute and storage.
However, it is only with the widespread adoption and migration to public clouds
that it has become a reality. The separation of compute and storage provides
efficiencies that were not possible in architectures that co-located compute
and storage, such as on-premises data warehouses and Hadoop clusters. In the
coming year, however, another paradigm for fully leveraging cloud
infrastructure resources will emerge - one that puts data at the center of the architecture.
The rise
of cloud data lake storage (e.g., Azure Data Lake Storage and Amazon S3) as the
default bit bucket in the cloud, combined with the infinite supply and
elasticity of cloud compute resources, has ushered in a new era in data
analytics architectures. Just as applications have moved to microservice
architectures, data itself is now able to fully exploit cloud capabilities.
Data can be stored and managed in open source file and table formats, such as
Apache Parquet and Apache Iceberg, and accessed by decoupled and elastic compute
engines including Apache Spark (batch), Dremio (SQL) and Apache Kafka
(streaming). With these advances data will, in essence, become its own tier,
enabling us to rethink data architectures and leverage application design
benefits for big data analytics.
The
Glow of the Cloud Data Warehouse Wears Off
The cloud
data warehouse vendors have leveraged the separation of storage from compute to
deliver offerings with a lower cost of entry than traditional data warehouses,
as well as improved scalability. However, the data itself isn't separated from
compute - it must first be loaded into the data warehouse and can only be
accessed through the data warehouse. This requires paying the data warehouse
vendor to get the data into and out of their system. Therefore, while upfront expenses for a cloud
data warehouse may be less, the costs at the end of the year are likely
significantly higher than expected.
Meanwhile,
low-cost cloud object storage is increasingly making the cloud data lake the
center of gravity for many organizations' data architectures. While data
warehouses provide a mechanism to query the data in the data lake directly, the
performance isn't sufficient to meet business needs. As a result, even if they
are taking advantage of the low-cost cloud data lake storage, organizations
still need to copy and move data to their data warehouse and incur the
associated data ingest costs. By leveraging modern cloud data lake engines and
open source table formats like Apache Iceberg and Project Nessie, however,
companies can now query data in the data lake directly without any degradation
of performance, resulting in an extreme reduction in complex and costly data
copies and movement.
Data
Governance Shifts the Power Back to Centralized Data Teams
The increasing demand
for data and analytics means that data teams are struggling to keep up with
never-ending requests from analysts and data scientists. As a result, data is
often extracted and shared without IT's supervision or control. At the same time,
macro-economic conditions combined with new privacy laws and breach concerns
will shift power back to centralized data teams. These teams will invest in
building enterprise-wide data platforms such as cloud data lakes, allowing them
to drastically reduce overall cloud costs by eliminating data copies and the
need to leverage expensive data warehouses. The ability to modify datasets and
delete records directly within data lakes will make it easier to handle the right to be forgotten, while open source data version control technologies will
enable centralized data governance by eliminating silos and promoting data
integrity.
##
About the
Author
Tomer
Shiran is the CPO and co-founder of Dremio. Prior to Dremio, he was VP Product
and at MapR, where he was responsible for product strategy, roadmap and new
feature development. As a member of the executive team, Tomer helped grow the
company from five employees to over 300 employees and 700 enterprise customers.
Prior to MapR, Tomer held numerous product management and engineering positions
at Microsoft and IBM Research. He holds a master's degree in electrical and
computer engineering from Carnegie Mellon University and a bachelor's degree in
computer science from Technion - Israel Institute of Technology, as well as
five U.S. patents.