Distributed Systems

Also refer to https://timilearning.com/posts/mit-6.824/lecture-16-memcache-at-facebook/ and https://blog.bytebytego.com/p/how-facebook-served-billions-of-requests

Pre-memcache

Adding memcache

What is memcache

memcached is a high-performance, distributed memory object caching system, generic in nature, but originally intended for use in speeding up dynamic web applications by alleviating database load.

Ref: https://memcached.org/about

Ref: https://engineering.fb.com/2013/04/15/core-infra/scaling-memcache-at-facebook/

How Facebook served billions of requests per second Using Memcached

Memcached is a well known, simple, inmemory caching solution. Memcached was originally developed by Brad Fitzpatrick for LiveJournal in 2003. It was originally written in Perl, but is rewritten in C by Anatoly Vorobey.

Ref: https://www.linuxjournal.com/article/7451

Facebook took up the open-source version of Memcached and enhanced it to build a distributed key-value store. This enhanced version was known as Memcache.

The open-source version Facebook started with provides a singlemachine in-memory hash table.
memcached provides no server-to-server coordination; it is an in-memory hash table running on a single server
They took this basic building block, made it more efficient, and used it to build a distributed key-value store that can process billions of requests per second that supports the world’s largest social network.

The following properties greatly influence their design.

First, users consume an order of magnitude more content than they create. This behavior results in a workload dominated by fetching data and suggests that caching can have significant advantages.
Second, our read operations fetch data from a variety of sources such as MySQL databases, HDFS installations, and backend services. This heterogeneity requires a flexible caching strategy able to store data from disparate sources.

Major design goals:

Any change must impact a userfacing or operational issue. Optimizations that have limited scope are rarely considered.
They treat the probability of reading transient stale data as a parameter to be tuned, similar to responsiveness. Willing to expose slightly stale data in exchange for insulating a backend storage service from excessive load.

How requests are served

They choose to delete cached data instead of updating it because deletes are idempotent.

Consistent Hashing

Items are distributed across the memcached servers through consistent hashing.

Consistent Hashing is a technique that allows the distribution of a set of keys across multiple nodes in a way that minimizes the impact of node failures or additions.

For example, Each key is assigned to the node that falls closest to it in a clockwise direction.

Clients maintain a map of all available servers, which is updated through an auxiliary configuration system.

Client-server communication:

Memcached servers do not communicate with each other.
When appropriate, we embed the complexity of the system into a stateless client rather than in the memcached servers.
- This greatly simplifies memcached and allows us to focus on making it highly performant for a more limited use case.
Keeping the clients stateless enables rapid iteration in the software and simplifies our deployment process.
Client logic is provided as two components. a library that can be embedded into applications or as a standalone proxy named mcrouter. This proxy presents a memcached server interface and routes the requests/replies to/from other servers.

Reducing latency

At Facebook’s scale, a single web request can trigger hundreds of fetch requests to retrieve data from Memcached servers. Consider a scenario where a user loads a popular page containing numerous posts and comments.

Parallel requests and batching**:

They structure our webapplication code to minimize the number of network round trips necessary to respond to page requests.
They construct a directed acyclic graph (DAG) representing the dependencies between data. By analyzing the DAG, the web server can determine the optimal order and grouping of data fetches.
A web server uses this DAG to maximize the number of items that can be fetched concurrently. On average these batches consist of 24 keys per request.

Using UDP

Facebook employed a clever strategy to optimize network communication between the web servers and the Memcache server.

For fetch requests, Facebook configured the clients to use UDP instead of TCP.
As you may know, UDP is a connectionless protocol and much faster than TCP. By using UDP, the clients can send fetch requests to the Memcache servers with less network overhead, resulting in faster request processing and reduced latency.
However, UDP has a drawback: it doesn’t guarantee the delivery of packets. If a packet is lost during transmission, UDP doesn’t have a built-in mechanism to retransmit it.
To handle such cases, they treated UDP packet loss as a cache miss on the client side. If a response isn’t received within a specific timeframe, the client assumes that the data is not available in the cache and proceeds to fetch it from the primary data source.

Problems with Caching

Leases:

Facebook introduced a new mechanism we call leases to address two problems: stale sets and thundering herds.
A stale set occurs when a web server sets a value in memcache that does not reflect the latest value that should be cached.
This can occur when concurrent updates to memcache get reordered.

How lease token solves this

Intuitively, a memcached instance gives a lease to a client to set data back into the cache when that client experiences a cache miss.
The lease is a 64-bit token bound to the specific key the client originally requested.
The client provides the lease token when setting the value in the cache.
With the lease token, memcached can verify and determine whether the data should be stored and thus arbitrate concurrent writes.
Verification can fail if memcached has invalidated the lease token due to receiving a delete request for that item.
Leases prevent stale sets in a manner similar to how load-link/storeconditional operates

A thundering herd happens when a specific key undergoes heavy read and write activity.
As the write activity repeatedly invalidates the recently set values, many reads default to the more costly path. The lease mechanism solves both problems.

All servers see a cache miss and everyone reaches out to database, increasing the load on the database.

Caches arbitrates access to the database:

A slight modification to leases also mitigates thundering herds.
Each memcached server regulates the rate at which it returns tokens.
By default, we configure these servers to return a token only once every 10 seconds per key.
Requests for a key’s value within 10 seconds of a token being issued results in a special notification telling the client to wait a short amount of time.
Typically, the client with the lease will have successfully set the data within a few milliseconds.
Thus, when waiting clients retry the request, the data is often present in cache.

Many memcache servers in one cluster

When you add more webservers, we would need more memcache servers

This would lead to every server communicating to every memcache server, all to all communication

One of the Problem is

When server wants some values and it does a wide parallel fetch.

When the server returns the responses, we would see packet drops on the client side because of network congestion

Memcache clients implement flowcontrol mechanisms to limit incast congestion.

When a client requests a large number of keys, the responses can overwhelm components such as rack and cluster switches if those responses arrive all at once.
Clients therefore use a sliding window mechanism to control the number of outstanding requests.
When the client receives a response, the next request can be sent.
Similar to TCP’s congestion control, the size of this sliding window grows slowly upon a successful request and shrinks when a request goes unanswered.
The window applies to all memcache requests independently of destination; whereas TCP windows apply only to a single stream.

Multiple clusters

It is tempting to buy more web and memcached servers to scale a cluster as demand increases.
However, naıvely scaling the system does not eliminate all problems.
Highly requested items will only become more popular as more web servers are added to cope with increased user traffic.
Incast congestion also worsens as the number of memcached servers increases.
We therefore split our web and memcached servers into multiple frontend clusters.
These clusters, along with a storage cluster that contain the databases, define a region.
This region architecture also allows for smaller failure domains and a tractable network configuration.
We trade replication of data for more independent failure domains, tractable network configuration, and a reduction of incast congestion.

The all to all communication limits horizontal scalability.

Now we will need to keep the caches consistent

SQL statements that modify authoritative state are amended to include memcache keys that need to be invalidated once the transaction commits.
We deploy invalidation daemons (named mcsqueal) on every database.
Each daemon inspects the SQL statements that its database commits, extracts any deletes, and broadcasts these deletes to the memcache deployment in every frontend cluster in that region.

Inter-cluster bandwidth is less than Intra-cluster bandwidth.

Geographically distributed clusters

Facebook is ok to write to master DB across because fb is read heavy system with 2 orders of higher magnitude

Memory allocation

One of the improvements Facebook made to memcached was moving to a smaller exponential so there is not as much waste in storing values in chunks. Instead of 2^n for the slab allocation, the latest versions of memcached use a much smaller growth exponential, 1.25^n, so you will see slabs with sizes 1KB, 1.25KB, 1.56KB, etc… This means that instead of 25% waste on average, you should see closer to 10%. Effectively you regain 15% of your memcached memory just by installing the latest version!