Distributed Systems

Ref:

https://www.codesmith.io/blog/amazon-s3-storage-diagramming-system-design
https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html

Bucket: A logical container for objects. The bucket name is globally unique. To upload data to S3, we must first create a bucket.

Object: An object is an individual piece of data we store in a bucket. It contains object data (also called payload) and metadata. Object data can be any sequence of bytes we want to store. The metadata is a set of name-value pairs that describe the object.

Timeline of features

Ref: https://highscalability.com/behind-aws-s3s-massive-scale/

The design philosophy of object storage is very similar to that of the UNIX file system.

In UNIX, when we save a file in the local file system, it does not save the filename and file data together.
Instead, the filename is stored in a data structure called “inode” and the file data is stored in different disk locations.
The inode contains a list of file block pointers that point to the disk locations of the file data.
When we access a local file, we first fetch the metadata in the inode. We then read the file data by following the file block pointers to the actual disk locations.

The object storage works similarly.

The inode becomes the metadata store that stores all the object metadata.
The hard disk becomes the data store that stores the object data.
In the UNIX file system, the inode uses the file block pointer to record the location of data on the hard disk.
In object storage, the metadata store uses the ID of the object to find the corresponding object data in the data store, via a network request.

Architecture

High level, ref: https://newsletter.systemdesign.one/p/s3-architecture

S3 is said to be composed of more than 300 microservices.

It tries to follow the core design principle of simplicity.

You can distinct its architecture by four high-level services:

a front-end fleet with a REST API
a namespace service
a storage fleet full of hard disks
a storage management fleet that does background operations, like replication and tiering.

Upload

Download

Multi-part upload

How can we optimize performance when we upload large files to object storage service such as S3?

It is possible to upload such a large object file directly, but it could take a long time.
If the network connection fails in the middle of the upload, we have to start over.
A better solution is to slice a large object into smaller parts and upload them independently.

You can now break your larger objects into chunks and upload a number of chunks in parallel. If the upload of a chunk fails, you can simply restart it. You’ll be able to improve your overall upload speed by taking advantage of parallelism.

Ref: https://blog.bytebytego.com/p/how-to-upload-a-large-file-to-s3 and https://aws.amazon.com/blogs/aws/amazon-s3-multipart-upload/

Storage Fleet

Hard Drives

Replication

Heat Management at Scale

But as the system aggregates millions of workloads, the underlying traffic to the storage flattens out remarkably. The aggregate demand results in a smoothened out, more predictable throughput.

When you aggregate on a large enough scale, a single workload cannot influence the aggregate peak.

The problem then becomes much easier to solve - you simply need to balance out a smooth demand rate across many disks.