Ceph Distributed Storage

Ceph is a distributed file system pioneered by Sage A. Weil which maximizes distributed capability of file storage in three aspects
1. Decoupled Data and Metadata
2. Dynamic Distributed Metadata Management
3. Autonomic Distributed Object Storage

Ceph’s essence is CRUSH (Controlled Replication Under Scalable Hashing) algorithm, a probabilistic hash algorithm (think of consistent hashing) based on RUSH algorithm, allowing all clients to compute location of any file data segment without centralized service (name node, typically, HDFS). CRUSH algorithm evolves from prior pseudo-random data distribution algorithms described below. Note that all papers list below in this blog are from SSRC (Storage Systems Research Center), University of California, Santa Cruz.

2006 Ceph: A Scalable, High-Performance Distributed File System
Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long

Sage’s overview of Ceph system, covering design principles, CRUSH, dynamic metadata subtree partitioning and relaxed POSIX file metadata semantics.

2006 CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data
Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Carlos Maltzahn

Explains how CRUSH, the pseudo-random deterministic function, maps an input value to a list of devices with weights on which to store object replicas. CRUSH extends RUSH by introducing straw buckets strategy (well described in this Chinese blog CRUSH straw ).

2004 Dynamic Metadata Management for Petabyte-scale File Systems
Sage A. Weil, Kristal T. Pollack, Scott A. Brandt, Ethan L. Miller

The dynamic metadata management system, where adaptive subtree partitioning is introduced to achieve both optimal hierarchical tree partitioning for cluster workload and also to load balance large number of clients accessing same file in “flash crowds” way.

2007 RADOS: A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters
Sage A. Weil, Andrew W. Leung, Scott A. Brandt, Carlos Maltzahn

RADOS is the service abstracted from Ceph.

2003 A Fast Algorithm for Online Placement and Reorganization of Replicated Data
R. J. Honicky, Ethan L. Miller

The initial pseudo-random data distribution algorithm (also Rushp in Rush family), based on which CRUSH is built. It supports weighted devices, adding or removing devices dynamically while achieving optimal data migration and data distribution. RUSHp utilizes an advanced analytic number theory result called the Prime Number Theorem for Arithmetic Progressions.

2004 Replication Under Scalable Hashing: A Family of Algorithms for Scalable Decentralized Data Distribution
R. J. Honicky, Ethan L. Miller

Excellent algorithm family of so called Scalable Distributed Data Structures.