Ceph: A Scalable, High Performance Distributed File System

2019-JUN-27 :: Ceph Tech Talk - Intro to Ceph

https://devops-insider.mygraphql.com/zh-cn/latest/ceph/ceph-mapping/ceph-mapping.html

The primary goals of the architecture are scalability (to hundreds of petabytes and beyond), performance, and reliability.

Ceph is designed to be extremely scalable; it is built upon the Reliable Autonomic Distributed Object Store (RADOS), a self-healing, self-managing storage layer that handles the fundamental complexity of data replication, failure detection, and recovery.

Unlike traditional architectures that rely on centralized controller nodes—which often become performance bottlenecks or single points of failure — Ceph employs a calculated placement algorithm known as CRUSH (Controlled Replication Under Scalable Hashing) to distribute data across a heterogeneous cluster.

CRUSH enables Ceph clients to communicate directly with OSDs, bypassing the need for a centralized server or broker.

Ceph uniquely delivers object, block, and file storage in one unified system.

Ref: https://canonical.com/blog/ceph-storage-on-ubuntu-an-overview

Ceph is designed to be scalable and to have no single point of failure.

RADOS

Object Storage Daemon (OSD) - OSDs manage data, interact with logical disks
Monitor (MON) - Manages cluster state(monitor map, the manager map, the OSD map, and the CRUSH map) https://docs.ceph.com/en/reef/architecture/#cluster-map
Manager (MGR) - Provides additional features like external monitoring, dashboard
Metadata Servers (MDS) - The Ceph metadata server daemon must be running in any Ceph cluster that runs the CephFS file system

What are OSD

More recent distributed file systems have adopted architectures based on object-based storage, in which conventional hard disks are replaced with intelligent object storage devices (OSDs) which combine a CPU, network interface, and local cache with an underlying disk or RAID.
OSDs replace the traditional block-level interface with one in which clients can read or write byte ranges to much larger (and often variably sized) named objects, distributing low-level block allocation decisions to the devices themselves.
Clients typically interact with a metadata server (MDS) to perform metadata operations (open, rename), while communicating directly with OSDs to perform file I/O (reads and writes), significantly improving overall scalability.

Ref: Ceph: A Scalable, High-Performance Distributed File System

The Ceph Storage Cluster receives data from Ceph Clients–whether it comes through a Ceph Block Device, Ceph Object Storage, the Ceph File System, or a custom implementation that you create by using librados.
The data received by the Ceph Storage Cluster is stored as RADOS objects.
Each object is stored on an Object Storage Device (this is also called an “OSD”). Ceph OSDs control read, write, and replication operations on storage drives.

Decoupled Data and Metadata—Ceph maximizes the separation of file metadata management from the storage of file data.

Metadata operations (open, rename, etc.) are collectively managed by a metadata server cluster, while clients interact directly with OSDs to perform file I/O (reads and writes).

Architecture

Pools

The Ceph storage system supports the notion of ‘Pools’, which are logical partitions for storing objects.

Ceph Clients retrieve a Cluster Map from a Ceph Monitor, and write RADOS objects to pools. The way that Ceph places the data in the pools is determined by the pool’s size or number of replicas, the CRUSH rule, and the number of placement groups in the pool.

Placement Groups

Tracking object placement on a per-object basis within a pool is computationally expensive at scale.
To facilitate high performance at scale, Ceph subdivides a pool into placement groups, assigns each individual object to a placement group, and assigns the placement group to a primary OSD.
If an OSD fails or the cluster re-balances, Ceph can move or replicate an entire placement group—i.e., all of the objects in the placement groups—without having to address each object individually. This allows a Ceph cluster to re-balance or recover efficiently.

Ref: https://docs.redhat.com/en/documentation/red_hat_ceph_storage/1.2.3/html/storage_strategies/about-placement-groups

When CRUSH assigns a placement group to an OSD, it calculates a series of OSDs—the first being the primary.

Each pool has a number of placement groups (PGs) within it. CRUSH dynamically maps PGs to OSDs. When a Ceph Client stores objects, CRUSH maps each RADOS object to a PG.

PGs do not own OSDs. CRUSH assigns many placement groups to each OSD pseudo-randomly to ensure that data gets distributed evenly across the cluster.

This layer of indirection allows Ceph to rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices come online

CRUSH

Summary

Ref: https://ceph.io/en/news/blog/2014/how-data-is-stored-in-ceph-cluster/

In this cluster, the files created (A.txt and J.txt in my diagram) are converted into several objects. These objects are then distributed into placement groups (pg) which are put into pools.
A pool has some properties configured as how many replicas of a pg will be stored in the cluster (3 by default). Those pg will finally be physically stored into an Object Storage Daemon (OSD). An OSD stores pg (and so the objects within it) and provides access to them over the network.

Ref: https://www.dbi-services.com/blog/introduction-to-rook-ceph-for-kubernetes/

The original paper doesn’t talk about Pools

Ref: https://devops-insider.mygraphql.com/zh-cn/latest/ceph/ceph-mapping/ceph-mapping.html

Objects that are stored in a Ceph cluster are put into pools.
Pools represent logical partitions of the cluster to the outside world. For each pool a set of rules can be defined, for example, how many replications of each object must exist. The standard configuration of pools is called replicated pool.

Rebalancing

Rados Gateway

RGW Components

RGW vs RADOS Object

Ceph uses bucket id instead of name

Bucket Indexing

To return list of objects in the bucket quickly, similar to directory

Hence during object creation, we need to update bucket index

We need all the above operations to be atomic to avoid inconsistencies, where bucket index is updated but object wasn’t created etc

Hence we first write the tail, then do a prepare(similar to 2-Phase commit) and only after that, we commit

More details about Omap(object map) is present at https://ivanzz1001.github.io/records/post/ceph/2019/01/05/ceph-src-code-part7_5

CephFS

Distributed Metadata Server

Bluestore

For a decade, the Ceph distributed file system followed the conventional wisdom of building its storage backend on top of local file systems. This is a preferred choice for most distributed file systems today because it allows them to benefit from the convenience and maturity of battle-tested code. Ceph’s experience, however, shows that this comes at a high price.

First, developing a zero-overhead transaction mechanism is challenging.
Second, metadata performance at the local level can significantly affect performance at the distributed level.
Third, supporting emerging storage hardware is painstakingly slow.

Ceph addressed these issues with BlueStore, a new backend designed to run directly on raw storage devices

Ceph utilizes a novel metadata cluster architecture based on Dynamic Subtree Partitioning that adaptively and intelligently distributes responsibility for managing the file system directory hierarchy among tens or even hundreds of MDSs.

Ref: File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution paper

Ref: https://insujang.github.io/2020-08-30/introduction-to-ceph/