Virtualization Technology News and Information
Dremio 2021 Predictions: The Emergence of the Modern Data Architecture

vmblog 2021 prediction series 

Industry executives and experts share their predictions for 2021.  Read them in this 13th annual series exclusive.

The Emergence of the Modern Data Architecture

By Tomer Shiran, CPO and co-founder of Dremio

In 2021, organizations will start to implement modern data architectures that both accelerate analytics and keep costs under control. Major trends will emerge that make modern cloud data lakes the center of gravity for data architectures. These trends challenge a decades old standard that, in order to query and analyze data, data engineers need to extract and load it into a costly, proprietary data warehouse. Furthermore, the increased need for cost control, security and data governance will shift the power to centralized data teams.

Separation of Compute and Data Becomes the Standard

For years, the industry has talked about the separation of compute and storage. However, it is only with the widespread adoption and migration to public clouds that it has become a reality. The separation of compute and storage provides efficiencies that were not possible in architectures that co-located compute and storage, such as on-premises data warehouses and Hadoop clusters. In the coming year, however, another paradigm for fully leveraging cloud infrastructure resources will emerge - one that puts data at the center of the architecture.

The rise of cloud data lake storage (e.g., Azure Data Lake Storage and Amazon S3) as the default bit bucket in the cloud, combined with the infinite supply and elasticity of cloud compute resources, has ushered in a new era in data analytics architectures. Just as applications have moved to microservice architectures, data itself is now able to fully exploit cloud capabilities. Data can be stored and managed in open source file and table formats, such as Apache Parquet and Apache Iceberg, and accessed by decoupled and elastic compute engines including Apache Spark (batch), Dremio (SQL) and Apache Kafka (streaming). With these advances data will, in essence, become its own tier, enabling us to rethink data architectures and leverage application design benefits for big data analytics.

The Glow of the Cloud Data Warehouse Wears Off

The cloud data warehouse vendors have leveraged the separation of storage from compute to deliver offerings with a lower cost of entry than traditional data warehouses, as well as improved scalability. However, the data itself isn't separated from compute - it must first be loaded into the data warehouse and can only be accessed through the data warehouse. This requires paying the data warehouse vendor to get the data into and out of their system.  Therefore, while upfront expenses for a cloud data warehouse may be less, the costs at the end of the year are likely significantly higher than expected.

Meanwhile, low-cost cloud object storage is increasingly making the cloud data lake the center of gravity for many organizations' data architectures. While data warehouses provide a mechanism to query the data in the data lake directly, the performance isn't sufficient to meet business needs. As a result, even if they are taking advantage of the low-cost cloud data lake storage, organizations still need to copy and move data to their data warehouse and incur the associated data ingest costs. By leveraging modern cloud data lake engines and open source table formats like Apache Iceberg and Project Nessie, however, companies can now query data in the data lake directly without any degradation of performance, resulting in an extreme reduction in complex and costly data copies and movement.

Data Governance Shifts the Power Back to Centralized Data Teams

The increasing demand for data and analytics means that data teams are struggling to keep up with never-ending requests from analysts and data scientists. As a result, data is often extracted and shared without IT's supervision or control.  At the same time, macro-economic conditions combined with new privacy laws and breach concerns will shift power back to centralized data teams. These teams will invest in building enterprise-wide data platforms such as cloud data lakes, allowing them to drastically reduce overall cloud costs by eliminating data copies and the need to leverage expensive data warehouses. The ability to modify datasets and delete records directly within data lakes will make it easier to handle the right to be forgotten, while open source data version control technologies will enable centralized data governance by eliminating silos and promoting data integrity.


About the Author

Tomer Shiran 

Tomer Shiran is the CPO and co-founder of Dremio. Prior to Dremio, he was VP Product and at MapR, where he was responsible for product strategy, roadmap and new feature development. As a member of the executive team, Tomer helped grow the company from five employees to over 300 employees and 700 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. He holds a master's degree in electrical and computer engineering from Carnegie Mellon University and a bachelor's degree in computer science from Technion - Israel Institute of Technology, as well as five U.S. patents.

Published Tuesday, December 15, 2020 7:25 AM by David Marshall
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<December 2020>