Skip to main content

Always learning - Consistent hashing to reduce impact of database shard rebalancing

Introduction

This post will describe an approach to balancing the distribution and redistribution of data in horizontally scaled databases.

It is something that I came across as part of preparing for system design interviews, which often involve discussions weighing up the pros and cons and risks of applying particular approaches to storing data.

Storing structured data at scale

What is structured data?

Structured data is at the core of most applications on the Internet.

Without going into too much detail, let's just expect that it is something that is stored in a way that enables us to consistently be able to retrieve it for later use and be in a particular known shape.

How does retrieval work? 

Retrieval of structured data basically involves taking an identifier, such as a primary key, and being able to quickly look up an index to the one location that is expected to give an authoritative perspective of the current state of the corresponding record.

In the simplest case of a single physical server acting as the database, it has full responsibility for all records, but that is a severe limitation on scalability.

Simple hash for sharding to scale horizontally

If we scale up to store data across mutiple physical servers we can avoid redundant duplicate processing of lookups by applying an approach to distribute the storage based on the primary key. When we only have two servers it may be find to apply a very simple hash and modulo arithmetic model to consistently balance the distribution of data.

Store or retrieve data with key 235723

Hashing the key, might give a value such as 5821725

Modulo 2 of the hashed value gets us to 1, so that record is stored on node 1, and is retrievable from there.

That works fine, until we come to needing to scale up further. Once we add a third node our modulo of the key drastically changes where the lookups will be mapped to.

An extra layer of indirection

Knowing about a limitation in advance allows us an opportunity to prepare for the future and mitigate against limitations.

Rather than the coarse grained approach of hashing directly down to the number of nodes that are available, we can choose a wide range of potential hash values and define our own distribution of nodes that sub-ranges should be mapped onto.

As nodes are added we can have more control over the redistribution of sub-sets of the data by adjusting the distribution to include new nodes. So instead of having to take the entire database offline to step through every item for potential redistribution we can restrict the blast radius down to a small sub-set at a time as the existing data gets migrated.

Likewise, if we have a situation where we want to scale down for some reason, we can apply a mechanism to update the mapping to different nodes in a gradual and balanced way.

Comments

Popular posts from this blog

Speeding up Software Builds for Continuous Integration

Downloading the Internet Can you remember the last time you started out on a clean development environment and ran the build of some software using Maven or Gradle for dependency management? It takes ages to download all of the necessary third party libraries from one or more remote repositories, leading to expressions like, "Just waiting for Maven to download the Internet". Once your development environment has been used for building a few projects the range of dependencies that will need to be downloaded for other builds reduces down as the previously referenced ones will now be cached and found locally on your computer's hard drive. What happens on the Continuous Integration environment? Now consider what goes on when Jenkins or your other preferred Continuous Integration server comes to build your software. If it doesn't have a local copy of the libraries that have been referenced then it is going to pay the cost of that slow " download the Internet" p...

2022 - A year in review

Just a look back over the last 12 months. January I moved back to Christchurch to live, after having spent a few months further south since moving back from London. Work was mainly around balancing other peoples' understanding and expectations around our use of Kafka. February I decided that it would be worthwhile to have a year's subscription for streaming Sky Sports, as some rugby matches that I would want to watch would be on at time when venues wouldn't be open. Having moved to Christchurch to be close to an office, now found myself working from home as Covid restrictions came back into effect across New Zealand. March Got back into some actual coding at work - as opposed to mainly reviewing pull requests for configuration changes for Kafka topics.  This became urgent, as the command line interface tool that our provisioning system was dependent on had been marked for deprecation. April   Had my first direct experience with Covid-19.  I only went for a test because ...

Applying AI to software development can be like following SatNav

Trying out a different navigation system A month or so ago I upgraded to a car that has a SatNav system included, so I have been trying to use that instead of the Maps app on my phone. My experiences with it so far have generally been good, but it is far from flawless - a bit like Artificial Intelligence (AI) in software development. As context, my previous vehicle was not too old to include SatNav, it just hadn't been set up with English language or New Zealand maps - one of the down sides of having a second hand vehicle that originated in Japan. Flawed or incomplete information Driving around central Christchurch can be a bit challenging at times as various roadworks are underway, leaving streets closed off or narrowed down to a single lane. It could be reasonable to expect that a basic navigation system might not have up to the minute awareness of those closures and restrictions. However, something that I did not expect to encounter was the navigation system advising me to expec...