Virtualization Technology News and Information
Article
RSS
PingCAP 2024 Predictions: In 2024, sharding pain will accelerate shift to distributed SQL

vmblog-predictions-2024 

Industry executives and experts share their predictions for 2024.  Read them in this 16th annual VMblog.com series exclusive.

In 2024, sharding pain will accelerate shift to distributed SQL

By Sunny Bains, Software Architect, PingCAP

When scaling IT systems, "divide and conquer" is the tried and true approach. In the world of databases, that means sharding or partitioning data.

Sharding has been standard practice as long as databases have existed. But as new data-generating data volumes soar into the terabytes, it's becoming increasingly unsustainable to shard and reshard manually. In fact, the pain of manually sharding and resharding large data sets could be one of the most important factors driving database migrations to database systems that can scale both vertically and horizontally by scaling out and scaling in automatically in 2024.

The idea behind sharding is simple enough. As a database grows, it requires more storage and compute. Sharding divides the database into discrete segments ("shards") of data and dedicates discrete resources (e.g. servers, storage) to each shard. Typically this process is done manually. IT staffers decide where each shard will live, procure the necessary hardware and transfer the shards to the new topology. 

You can probably see the issue. The larger the database, the more complicated the sharding project. Resharding a small database is manageable. Resharding a 10TB or larger table is a nightmare. This is a moment when all sorts of Internet-of-Things devices are coming online, OLTP systems are running at staggering scale and growing every year. Databases at midsize enterprises now commonly reach terabyte scale. In many of those cases, manual sharding just doesn't make sense anymore. The cost to store all this data is also growing. According to anecdotal reports only 15% of data is "hot," meaning required for online operations. The implication is that "cold" data can be stored on a cheaper medium. This has created a requirement for disaggregated storage where hot data is stored on faster but more expensive media, and cold data on slower but cheaper storage. Users expect such storage disaggregation to be done automatically. Sharding the data automatically makes this much easier to implement.

So why do so many businesses keep sharding the old-fashioned way? It's the boiling frog phenomenon. Few companies achieve massive scale overnight. Even fewer design for scale when they're small. Sharding is manageable when you're just starting out. And once you get big, as long as budgets are flush, you can keep scaling by throwing personnel at the problem. The trouble is good times don't last forever. Headcounts eventually get cut. And the database keeps growing. That's when the pain of sharding really starts to bite.

Some database systems actually require sharding whether the application needs it or not. The user needs to know what shard the data lives in, and if a transaction fails across multiple shards, it's on the developer to roll back the transaction on the shards on which the commit failed. If the business value is there - great. Shard away. But in all too many cases, businesses are laboring to maintain larger and larger databases that need to be sharded and resharded - certainly not enough to justify the inconvenience, cost, expense and risk. And a lot of them have had it.

Tighter IT budgets in 2024 will only make the problem more acute. The need to cut costs will accelerate the shift toward technologies like distributed SQL databases, which eliminate the need for explicit sharding and provide multi-tenancy capabilities. These capabilities will help them consolidate their database instances and reduce cost and complexity.

To be clear, sharding a database is not a bad thing in and of itself. Manually sharding and resharding it is, especially when databases grow as large as they often do today. As with so many aspects of IT, the complexity and scale of the average database is beginning to outstrip human capacity. The only practical solution is automation.

Today's auto-sharding databases are already vastly more reliable - and smarter about distributing data - than they were 10 years ago, and AI will make them even more so. If you want to know where AI will make the biggest impact on databases, don't look just at chatbots. Look at the complexity of sharding and data placement rules. This goes beyond mere scaling. By predicting data access patterns we will be able to prefetch data closer to where it's required, we can even use similar predictions to more intelligently shard the data - for example, by optimizing for disaggregated storage and access patterns.

It's often said that every company is a technology company, or even a software company. It's certainly true that more and more companies are struggling to scale their data operations. The old ways of doing things aren't working for them anymore, and sharding by hand is a prime example. It won't disappear right away, but in the not too distant future, the idea of manual sharding might seem as exotic as coding in assembly language - not extinct, but replaced by and large with systems that manage the process automatically.

##

ABOUT THE AUTHOR

Sunny Bains 

Sunny Bains is a software architect at PingCAP, the company behind TiDB. He has worked on storage engines for more than 22 years. His first acquaintance with database kernel work was in 2001, when he was tasked with writing a database engine from scratch. In 2006, he joined the Oracle InnoDB team and was the team lead from 2013. While there, he fixed all the core InnoDB bottlenecks and made it scale to what it is today. Some of his other contributions include parallel scanning of the B-tree, parallel DDL, and the full-text search core.

Published Thursday, January 11, 2024 7:40 AM by David Marshall
Comments
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
Calendar
<January 2024>
SuMoTuWeThFrSa
31123456
78910111213
14151617181920
21222324252627
28293031123
45678910