Industry executives and experts share their predictions for 2024. Read them in this 16th annual VMblog.com series exclusive.
In 2024, sharding pain will accelerate shift to distributed SQL
By Sunny Bains, Software Architect,
PingCAP
When scaling IT systems, "divide and conquer"
is the tried and true approach. In the world of databases, that means sharding
or partitioning data.
Sharding has been standard practice as long as
databases have existed. But as new data-generating data volumes soar into the
terabytes, it's becoming increasingly unsustainable to shard and reshard
manually. In fact, the pain of manually sharding and resharding large data sets
could be one of the most important factors driving database migrations to
database systems that can scale both vertically and horizontally by scaling out
and scaling in automatically in 2024.
The idea behind sharding is simple enough. As
a database grows, it requires more storage and compute. Sharding divides the
database into discrete segments ("shards") of data and dedicates discrete
resources (e.g. servers, storage) to each shard. Typically this process is done
manually. IT staffers decide where each shard will live, procure the necessary
hardware and transfer the shards to the new topology.
You can probably see the issue. The larger the
database, the more complicated the sharding project. Resharding a small
database is manageable. Resharding a 10TB or larger table is a nightmare. This
is a moment when all sorts of Internet-of-Things devices are coming online,
OLTP systems are running at staggering scale and growing every year. Databases
at midsize enterprises now commonly reach terabyte scale. In many of those
cases, manual sharding just doesn't make sense anymore. The cost to store all
this data is also growing. According to anecdotal reports only 15% of data is
"hot," meaning required for online operations. The implication is that "cold"
data can be stored on a cheaper medium. This has created a requirement for
disaggregated storage where hot data is stored on faster but more expensive
media, and cold data on slower but cheaper storage. Users expect such storage
disaggregation to be done automatically. Sharding the data automatically makes
this much easier to implement.
So why do so many businesses keep sharding the
old-fashioned way? It's the boiling frog phenomenon. Few companies achieve
massive scale overnight. Even fewer design for scale when they're small.
Sharding is manageable when you're just starting out. And once you get big, as
long as budgets are flush, you can keep scaling by throwing personnel at the
problem. The trouble is good times don't last forever. Headcounts eventually
get cut. And the database keeps growing. That's when the pain of sharding
really starts to bite.
Some database systems actually require sharding whether the application
needs it or not. The user needs to know what shard the data lives in, and if a
transaction fails across multiple shards, it's on the developer to roll back
the transaction on the shards on which the commit failed. If the business value
is there - great. Shard away. But in all too many cases, businesses are
laboring to maintain larger and larger databases that need to be sharded and
resharded - certainly not enough to justify the inconvenience, cost, expense
and risk. And a lot of them have had it.
Tighter IT budgets in 2024 will only make the
problem more acute. The need to cut costs will accelerate the shift toward
technologies like distributed SQL databases, which eliminate the
need for explicit sharding and provide multi-tenancy capabilities. These
capabilities will help them consolidate their database instances and reduce
cost and complexity.
To be clear, sharding a database is not a bad
thing in and of itself. Manually
sharding and resharding it is, especially when databases grow as large as they
often do today. As with so many aspects of IT, the complexity and scale of the
average database is beginning to outstrip human capacity. The only practical
solution is automation.
Today's auto-sharding databases are already
vastly more reliable - and smarter about distributing data - than they were 10
years ago, and AI will make them even more so. If you want to know where AI
will make the biggest impact on databases, don't look just at chatbots. Look at
the complexity of sharding and data placement rules. This goes beyond mere
scaling. By predicting data access patterns we will be able to prefetch data
closer to where it's required, we can even use similar predictions to more intelligently
shard the data - for example, by optimizing for disaggregated storage and
access patterns.
It's often said that every company is a
technology company, or even a software company. It's certainly true that
more and more companies are struggling to scale their data operations. The old
ways of doing things aren't working for them anymore, and sharding by hand is a
prime example. It won't disappear right away, but in the not too distant
future, the idea of manual sharding might seem as exotic as coding in assembly
language - not extinct, but replaced by and large with systems that manage the
process automatically.
##
ABOUT THE AUTHOR
Sunny Bains is a software architect at PingCAP, the company behind TiDB. He has worked on storage engines for more than 22 years. His first acquaintance with database kernel work was in 2001, when he was tasked with writing a database engine from scratch. In 2006, he joined the Oracle InnoDB team and was the team lead from 2013. While there, he fixed all the core InnoDB bottlenecks and made it scale to what it is today. Some of his other contributions include parallel scanning of the B-tree, parallel DDL, and the full-text search core.