By Sida Shen, Product Marketing Manager, CelerData
Recently I ran across some useful advice from the entrepreneur and
tech investor Mark
Walsh, who was part of the first big tech boom in the 1990s.
Walsh had a simple mantra that he frequently invoked during his
career when analyzing any market or business strategy: "Open beats closed,
easy beats hard and cheap beats expensive." Over his more than 30 years
working in tech, Walsh came to believe that these three simple rules were true
for all technology companies and felt that they gave him great clarity in
determining the potential success or failure of many startup ideas and company
business plans.
You can trace these rules to the rise and fall of many major tech
companies since the beginning of the internet era in the 1990s, from AOL to
BlackBerry to Yahoo! and even Blockbuster Video. Usually, when we see a sudden
shift in a business model or a rapid changing of the guard, there's some
element of Walsh's mantra at play - from over-investing in a closed system or
making things too complex for customers, to more positive actions like breaking
out of proprietary barriers or greatly reducing the cost of a service. In
essence, Walsh's rules highlight how technological democratization tends to win
out over time, to the benefit of the end-user.
Fast forward to today's world of data storage, analysis and
management. For a few years now, what used to be an either/or choice between
data lakes and data warehouses has instead morphed into a third road forward:
the data lakehouse. Combining the best qualities of both - unstructured and
wide datasets that can be parsed more thoroughly at speed - the wider use of
lakehouses has led to real gains in actionable insights and analysis, and many
providers have restructured their offers around this new model.
However, while the data lakehouse concept has been around for a
few years at least, and while quite a few companies have placed strong bets on
it (analyst firm Forrester notes at least 13 major vendors for the
space) , there's a new direction taking hold in the lakehouse world, and that's
open source.
Open source data lakehouses allow for seamless interaction between
a much wider range of data processing and analytics tools, paving the way for
increased scalability, flexibility and adaptability. And they have some great
advantages over traditional lakehouses in ways that continue to prove Walsh
right.
First is the overall approach to data architecture. There's a
vast difference in how you build and even iterate when you're creating on a
customer licensing basis or creating for a community. It's not always a bad
thing, but because many providers are trying to simultaneously meet customer
needs but also still reinforce their existing solutions, they can end up with
fewer options and ideas - it's not quite blue sky. By building with the mindset
of adding on to existing products, you can create a more restrictive data
infrastructure and architecture. It's like adding on to an existing building
versus starting from scratch.
By contrast, open source lakehouse architecture allows
organizations to integrate and use tools right at the start that are best for
their use cases, and to construct their processing and integration accordingly.
Likewise, this openness also means you are not locked into a particular vendor
or arrangement. Building for performance often requires flexibility, and the
interoperability that open source provides means you're able to integrate a
very diverse set of tools and systems and choose from a much wider palette of
options without having to think about compatibility issues. And as your
needs change, you can even switch from commercial providers to open source and
back again. Open beats closed, as Walsh might say.
Innovation is another key area where open source provides
advantages for lakehouse constructs. With a proprietary system, engineers and
users have to wait for product updates or patches. Because these rely on a
limited number of engineers, they invariably are on a slower cycle than an open
source system, where a community of thinkers and doers are constantly at work
in the background making updates and upgrades. This leads to new technologies
and features being introduced more quickly, and also a broader range of people
testing out and giving immediate feedback on them. This rapid virtuous cycle
often results in these new approaches getting field-tested much more quickly -
and thoroughly - than if they were part of a proprietary upgrade sales plan.
And all these being at the much more cost-efficient open source price point
also encourages use, testing and development. An example of cheap beating
expensive.
Finally, transparency is an underrated yet
crucial part of the conversation. One of the real benefits of an open source
approach is just that - having accessible codebase and operational procedures
that provide increased visibility for all parties. This not only helps increase
trust in the overall system and process, but it also provides a real
opportunity to improve overall security of the lakehouse.
Open systems strengthen data governance and security by offering a
clear view of data handling processes, which provide the community with the
opportunity to fully examine code and database pipelines for vulnerabilities.
Additionally, it allows developers to create solutions in alignment with
security standards. Extra transparency also helps untangle that confusing web
of connected databases, storage and management, which, when seen in the bright
light of day can show us areas for improving efficiency and even for adding new
services and functions. Simplicity over complexity, as it were.
The concept of the data lakehouse has come a long way in the past
few years. We've gone from thinking we'd always have to use separate datasets
for storage and analysis to being able to work with both harmoniously and with
less duplication and denormalization. We also expected to sacrifice speed for
thoroughness, or limit our datasets, and now we see that is also no longer the
case. So, this model of managing and learning from our data is growing and
evolving right before our eyes.
And that growth will continue so long as we follow the outlines of
Walsh's advice. Data lakehouses are a new area, and they will need the
continual feedback, innovation and growth that the open source community can
provide if they are to continue to develop and become the norm for how we
handle our data analyses. This means we need to always remember: Open beats
closed.
##
ABOUT THE AUTHOR
Sida Shen is a
product manager for CelerData. He's spent years in the analytics and data
engineering space and is passionate about helping developers simplify their
real-time analytics workloads.