Motivation for consistent hashing:

REF: https://www.slideshare.net/oemebamo/introduction-to-memcached and https://serhatgiydiren.github.io/system-design-interview-distributed-cache

Caching

Simple Key/Value Store with the following operations save, get and delete.

Functional Requirements

Non-Functional Requirements

Terminology

Cache access patterns:

Working with HashTable:

Collision Resolution Techniques

image

The problem of naive hashing function

A naive hashing function is key % n where n is the number of servers.

It has two major drawbacks:

Problem 2 can be resolved by hashing the key first, hash(key) % n, so that the hashed keys will be likely to be distributed more evenly. But this can’t solve the problem 1. We need to find a solution that can distribute the keys and is not dependent on n.

Scaling Out: Distributed Hashing

REF: https://www.toptal.com/big-data/consistent-hashing and https://medium.com/omarelgabrys-blog/consistent-hashing-beyond-the-basics-525304a12ba and http://www.ines-panker.com/2019/07/29/consistent-hashing.html

In some situations, it may be necessary or desirable to split a hash table into several parts, hosted by different servers. One of the main motivations for this is to bypass the memory limitations of using a single computer, allowing for the construction of arbitrarily large hash tables (given enough servers).

Note that the hash function used for key distribution must be the same one across all clients, but it need not be the same one used internally by the caching servers.

The Rehashing Problem

This distribution scheme is simple, intuitive, and works fine. That is, until the number of servers changes.

What happens if one of the servers crashes or becomes unavailable? Keys need to be redistributed to account for the missing server, of course. The same applies if one or more new servers are added to the pool;keys need to be redistributed to include the new servers. This is true for any distribution scheme, but the problem with our simple modulo distribution is that when the number of servers changes, most hashes modulo N will change, so most keys will need to be moved to a different server.

So, even if a single server is removed or added, all keys will likely need to be rehashed into a different server.

Solution: Consistent Hashing

Consistent Hashing is a distributed hashing scheme that operates independently of the number of servers or objects in a distributed hash table by assigning them a position on an abstract circle, or hash ring. This allows servers and objects to scale without affecting the overall system.

image

image

image

image

Since we have the keys for both the objects and the servers on the same circle, we may define a simple rule to associate the former with the latter: Each object key will belong in the server whose key is closest, in a counterclockwise direction(or clockwise, depending on the conventions used).

There are clients for several systems, such as Memcached and Redis, that include support for consistent hashing out of the box.

Issues with Basic approach of Consistent hashing

A technique called virtual nodes or replicas is used to solve these problems.

A virtual node refers to the real node, each server is represented by multiple virtual nodes on the ring. For example if we assume each server has 3 virtual nodes(in real-world systems, the number of virtual nodes is much larger), server A will be mapped to A0, A1 and A2. Instead of just A, we can have A0, A1 and A2 to represent server A on the ring.

image

Ref: https://liuzhenglaichn.gitbook.io/system-design/advanced/consistent-hashing