Virtualization Technology News and Information
Open Beats Closed: How Open Source Is Changing Data Management

By Sida Shen, Product Marketing Manager, CelerData

Recently I ran across some useful advice from the entrepreneur and tech investor Mark Walsh, who was part of the first big tech boom in the 1990s.

Walsh had a simple mantra that he frequently invoked during his career when analyzing any market or business strategy: "Open beats closed, easy beats hard and cheap beats expensive." Over his more than 30 years working in tech, Walsh came to believe that these three simple rules were true for all technology companies and felt that they gave him great clarity in determining the potential success or failure of many startup ideas and company business plans.

You can trace these rules to the rise and fall of many major tech companies since the beginning of the internet era in the 1990s, from AOL to BlackBerry to Yahoo! and even Blockbuster Video. Usually, when we see a sudden shift in a business model or a rapid changing of the guard, there's some element of Walsh's mantra at play - from over-investing in a closed system or making things too complex for customers, to more positive actions like breaking out of proprietary barriers or greatly reducing the cost of a service. In essence, Walsh's rules highlight how technological democratization tends to win out over time, to the benefit of the end-user.

Fast forward to today's world of data storage, analysis and management. For a few years now, what used to be an either/or choice between data lakes and data warehouses has instead morphed into a third road forward: the data lakehouse. Combining the best qualities of both - unstructured and wide datasets that can be parsed more thoroughly at speed - the wider use of lakehouses has led to real gains in actionable insights and analysis, and many providers have restructured their offers around this new model. 

However, while the data lakehouse concept has been around for a few years at least, and while quite a few companies have placed strong bets on it (analyst firm Forrester notes at least 13 major vendors for the space) , there's a new direction taking hold in the lakehouse world, and that's open source.

Open source data lakehouses allow for seamless interaction between a much wider range of data processing and analytics tools, paving the way for increased scalability, flexibility and adaptability. And they have some great advantages over traditional lakehouses in ways that continue to prove Walsh right.

First is the overall approach to data architecture. There's a vast difference in how you build and even iterate when you're creating on a customer licensing basis or creating for a community. It's not always a bad thing, but because many providers are trying to simultaneously meet customer needs but also still reinforce their existing solutions, they can end up with fewer options and ideas - it's not quite blue sky. By building with the mindset of adding on to existing products, you can create a more restrictive data infrastructure and architecture. It's like adding on to an existing building versus starting from scratch.

By contrast, open source lakehouse architecture allows organizations to integrate and use tools right at the start that are best for their use cases, and to construct their processing and integration accordingly. Likewise, this openness also means you are not locked into a particular vendor or arrangement. Building for performance often requires flexibility, and the interoperability that open source provides means you're able to integrate a very diverse set of tools and systems and choose from a much wider palette of options without having to think about compatibility issues.  And as your needs change, you can even switch from commercial providers to open source and back again. Open beats closed, as Walsh might say.

Innovation is another key area where open source provides advantages for lakehouse constructs. With a proprietary system, engineers and users have to wait for product updates or patches. Because these rely on a limited number of engineers, they invariably are on a slower cycle than an open source system, where a community of thinkers and doers are constantly at work in the background making updates and upgrades. This leads to new technologies and features being introduced more quickly, and also a broader range of people testing out and giving immediate feedback on them. This rapid virtuous cycle often results in these new approaches getting field-tested much more quickly - and thoroughly - than if they were part of a proprietary upgrade sales plan. And all these being at the much more cost-efficient open source price point also encourages use, testing and development. An example of cheap beating expensive.

Finally, transparency is an underrated yet crucial part of the conversation. One of the real benefits of an open source approach is just that - having accessible codebase and operational procedures that provide increased visibility for all parties. This not only helps increase trust in the overall system and process, but it also provides a real opportunity to improve overall security of the lakehouse. 

Open systems strengthen data governance and security by offering a clear view of data handling processes, which provide the community with the opportunity to fully examine code and database pipelines for vulnerabilities. Additionally, it allows developers to create solutions in alignment with security standards. Extra transparency also helps untangle that confusing web of connected databases, storage and management, which, when seen in the bright light of day can show us areas for improving efficiency and even for adding new services and functions. Simplicity over complexity, as it were.

The concept of the data lakehouse has come a long way in the past few years. We've gone from thinking we'd always have to use separate datasets for storage and analysis to being able to work with both harmoniously and with less duplication and denormalization. We also expected to sacrifice speed for thoroughness, or limit our datasets, and now we see that is also no longer the case. So, this model of managing and learning from our data is growing and evolving right before our eyes.

And that growth will continue so long as we follow the outlines of Walsh's advice. Data lakehouses are a new area, and they will need the continual feedback, innovation and growth that the open source community can provide if they are to continue to develop and become the norm for how we handle our data analyses. This means we need to always remember: Open beats closed.



Sida Shen 

Sida Shen is a product manager for CelerData. He's spent years in the analytics and data engineering space and is passionate about helping developers simplify their real-time analytics workloads.

Published Wednesday, June 12, 2024 7:30 AM by David Marshall
Filed under: ,
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<June 2024>