The problem with sharding
New SaaS startups rarely think about how to scale their applications. Sure, they envision the day they will need to expand, and they build growth into their financial plans, but they rarely architect their applications for scalability early on. Rather, they more often focus on completing features that they can sell.
But the time to think about scaling is at the very beginning-before your first customer signs up for your service. As the company rolls out feature after feature, and customers continue to sign up, the business grows. As the business grows, scaling becomes a point of concern.
The need to scale often becomes evident when a new SaaS service hits a resource capacity limit, specifically data access resource capacity. Often the database, whatever the technology, is too small to meet growing demands and can’t be expanded past a certain point.
This problem can occur no matter what database technology you use, and no matter what size of server or other infrastructure you put in place to give yourself the room to grow. Sooner or later, you’ll hit scaling issues.
Once the scaling resource wall becomes apparent, and serious scaling decisions need to be made, data sharding-dividing your data across multiple parallel databases with each database holding a segment of your business-is often one of the early solutions that is introduced to expand the ability to scale your application. There are many reasons for this:
- Dividing your data into multiple segments is seemingly a simple solution to solving data resource issues. If one database is too small to handle your traffic, let’s try two, or three, or four!
- Once you’ve sharded your application data, it’s seemingly very simple to continue scaling using the same approach-as your traffic grows, simply add more parallel databases to your application.
Let’s take a closer look at sharding and how it is used to solve early database scaling issues.
A simple sharding example
What, exactly, is sharding? A typical SaaS use case involves customers talking to some application, which then makes use of data that’s stored in a database. As the number of customers increases, the load on the application increases. Typically, it’s relatively easy to increase the capacity of the application by adding more servers to handle the load. (This of course has limits, and is worthy of a separate discussion at another time.)
However, once you reach a certain number of customers, your scaling limitation suddenly becomes your database. Your database can’t effectively handle additional customers, and your application will end up with availability issues, performance issues, and other problems. This is illustrated in Figure 1.
Once your database reaches a certain size and capacity, it’s difficult (if not impossible) to make it grow any larger. Instead, you might choose to split the database into multiple, parallel databases and divide the customer base between the different databases.
In Figure 2, we split the customers across two separate databases, and suddenly we can handle the additional customers without problems. Each database contains all the data necessary to support a given customer, but individual customers are split across different databases.
How do you split the data across multiple databases and know, within the application, which database has which customer’s data? Typically, a sharding key is used to determine which database contains a particular set of data. Often, this sharding key is something such as a customer ID. By assigning some customer IDs to one database and other customer IDs to another database, you can put all the data for a given customer onto a single database. Thus for each customer a single database will be used for all customer requests, and new customers can be added to new databases at any reasonable scale.
Where sharding goes wrong
So, what’s wrong with this approach? The problems start when your customers begin to grow. As customers start using the application more, they start using more storage and consuming more resources. Suddenly, the capacity for one of your shards gets overloaded, and you need to offload some of your customers from one shard to another (less heavily loaded) shard. You have to take all of these customers’ data, copy it to a new shard, then point their customer IDs to the new shard.
This isn’t a trivial operation. It’s especially non-trivial if you want to accomplish it without causing any noticeable downtime for customers. How do you move tons of data for a given customer without impacting the customer’s ability to access the application while the move is occuring? The answer usually involves writing custom tooling. This tooling is typically non-trivial to write and risky to execute. Figure 3 illustrates this process when a “Too Large Customer” overloads one database, and you have to move them to another, newer, database.
The next problem that occurs is when one customer becomes so large that it requires an entire database shard all by itself. When you are in this situation, what happens when that customer grows a bit bigger?
Suddenly, there is no place for you to move this customer, and you have reached another scaling limit-a limit that your current sharding strategy simply can’t handle.
Repartitioning, rebalancing, skewed usage, cross-shard reporting, and partitioned analytics are more problems that have to be dealt with. However, the need to handle rapidly changing data set sizes and the need to move data between shards are the biggest challenges with a quality sharding mechanism.
To shard or not to shard?
If you don’t need to shard, don’t! You can utilize other strategies, including partitioning data by service and function rather than slicing it into shards, to handle data scaling.
However, sometimes sharding is unavoidable. So, if you must shard, keep the following in mind:
- Set up your shards long before you need them. Anticipate your need for sharding based on an optimistic scale, and shard long before actual usage demands it.
- Select your sharding key carefully. You want your shards to be independent, but also well balanced. Using customer ID may seem like a good idea-it allows you to create independent datasets easily-but customers vary greatly in size, and balancing shards based on customer ID can be troublesome. It is possible to shard based on another common resource, but the specific answer depends heavily on your application’s business logic and needs.
- Build tools to manage your shards before you roll out sharding to production. You will need the tools much sooner than you expect. The tools need to be able to quickly and efficiently move individual sharded elements (customers, etc.) from one shard to another transparently. The tools must be able to rebalance multiple resources quickly during a scaling incident, and you need analytics to alert when shard sizing turns sideways.
- Look seriously into dividing your data by other methods. Consider storing your data within individual services and microservices, rather than in a centralized data store. The smaller the data set, the less need for sharding and the simpler and more efficient it is to manage sharding when required.
Most modern applications experience growth-growth in their usage, growth in the size and complexity of their data, growth in the application complexity, and growth in the number of people and the size of the organization needed to manage the application. It’s easy to ignore these growth challenges until it’s too late, then use a quick and easy fix to solve the immediate need. But when it comes to data sharding, planning and thorough execution are critical to ensuring that this architectural choice is a scaling aid, rather than a scaling liability.
Copyright © 2021 IDG Communications, Inc.
Article written by Lee Atchison, originally published at InfoWorld.