Ceph: A Scalable, High Performance Distributed File System

2019-JUN-27 :: Ceph Tech Talk - Intro to Ceph

https://devops-insider.mygraphql.com/zh-cn/latest/ceph/ceph-mapping/ceph-mapping.html

The primary goals of the architecture are scalability (to hundreds of petabytes and beyond), performance, and reliability. image

image

image

Ceph is designed to be extremely scalable; it is built upon the Reliable Autonomic Distributed Object Store (RADOS), a self-healing, self-managing storage layer that handles the fundamental complexity of data replication, failure detection, and recovery.

Unlike traditional architectures that rely on centralized controller nodes—which often become performance bottlenecks or single points of failure — Ceph employs a calculated placement algorithm known as CRUSH (Controlled Replication Under Scalable Hashing) to distribute data across a heterogeneous cluster.

CRUSH enables Ceph clients to communicate directly with OSDs, bypassing the need for a centralized server or broker.

Ceph uniquely delivers object, block, and file storage in one unified system.

image Ref: https://canonical.com/blog/ceph-storage-on-ubuntu-an-overview

image

image

Ceph is designed to be scalable and to have no single point of failure.

RADOS

image

image

image

image

image

image

What are OSD

image

Ref: Ceph: A Scalable, High-Performance Distributed File System

image

Decoupled Data and Metadata—Ceph maximizes the separation of file metadata management from the storage of file data.

image

image

Architecture

image

image

Pools

The Ceph storage system supports the notion of ‘Pools’, which are logical partitions for storing objects.

Ceph Clients retrieve a Cluster Map from a Ceph Monitor, and write RADOS objects to pools. The way that Ceph places the data in the pools is determined by the pool’s size or number of replicas, the CRUSH rule, and the number of placement groups in the pool.

image

image

Placement Groups

image

Ref: https://docs.redhat.com/en/documentation/red_hat_ceph_storage/1.2.3/html/storage_strategies/about-placement-groups

When CRUSH assigns a placement group to an OSD, it calculates a series of OSDs—​the first being the primary.

image

PGs do not own OSDs. CRUSH assigns many placement groups to each OSD pseudo-randomly to ensure that data gets distributed evenly across the cluster.

This layer of indirection allows Ceph to rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices come online

CRUSH

image

Summary

image

Ref: https://ceph.io/en/news/blog/2014/how-data-is-stored-in-ceph-cluster/

image

image

image

Ref: https://www.dbi-services.com/blog/introduction-to-rook-ceph-for-kubernetes/

image

The original paper doesn’t talk about Pools

image

Ref: https://devops-insider.mygraphql.com/zh-cn/latest/ceph/ceph-mapping/ceph-mapping.html

image

image

Rebalancing

image

Rados Gateway

image

image

image

RGW Components

image

image

image

RGW vs RADOS Object

image

Ceph uses bucket id instead of name

image

image

image

Bucket Indexing

Hence during object creation, we need to update bucket index image

We need all the above operations to be atomic to avoid inconsistencies, where bucket index is updated but object wasn’t created etc

Hence we first write the tail, then do a prepare(similar to 2-Phase commit) and only after that, we commit image

image

More details about Omap(object map) is present at https://ivanzz1001.github.io/records/post/ceph/2019/01/05/ceph-src-code-part7_5

CephFS

image

image

image

Distributed Metadata Server

image

image

Bluestore

For a decade, the Ceph distributed file system followed the conventional wisdom of building its storage backend on top of local file systems. This is a preferred choice for most distributed file systems today because it allows them to benefit from the convenience and maturity of battle-tested code. Ceph’s experience, however, shows that this comes at a high price.

Ceph addressed these issues with BlueStore, a new backend designed to run directly on raw storage devices

image

Ceph utilizes a novel metadata cluster architecture based on Dynamic Subtree Partitioning that adaptively and intelligently distributes responsibility for managing the file system directory hierarchy among tens or even hundreds of MDSs.

image

Ref: File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution paper

Ref: https://insujang.github.io/2020-08-30/introduction-to-ceph/