Month: January 2018

Ceph Paper

Ceph Distributed Storage

Ceph is a distributed file system pioneered by Sage A. Weil which maximizes distributed capability of file storage in three aspects
1. Decoupled Data and Metadata
2. Dynamic Distributed Metadata Management
3. Autonomic Distributed Object Storage

Ceph’s essence is CRUSH (Controlled Replication Under Scalable Hashing) algorithm, a probabilistic hash algorithm (think of consistent hashing) based on RUSH algorithm, allowing all clients to compute location of any file data segment without centralized service (name node, typically, HDFS). CRUSH algorithm evolves from prior pseudo-random data distribution algorithms described below. Note that all papers list below in this blog are from SSRC (Storage Systems Research Center), University of California, Santa Cruz.

2006 Ceph: A Scalable, High-Performance Distributed File System
Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long

Sage’s overview of Ceph system, covering design principles, CRUSH, dynamic metadata subtree partitioning and relaxed POSIX file metadata semantics.

2006 CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data
Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Carlos Maltzahn

Explains how CRUSH, the pseudo-random deterministic function, maps an input value to a list of devices with weights on which to store object replicas. CRUSH extends RUSH by introducing straw buckets strategy (well described in this Chinese blog CRUSH straw ).

2004 Dynamic Metadata Management for Petabyte-scale File Systems
Sage A. Weil, Kristal T. Pollack, Scott A. Brandt, Ethan L. Miller

The dynamic metadata management system, where adaptive subtree partitioning is introduced to achieve both optimal hierarchical tree partitioning for cluster workload and also to load balance large number of clients accessing same file in “flash crowds” way.

2007 RADOS: A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters
Sage A. Weil, Andrew W. Leung, Scott A. Brandt, Carlos Maltzahn

RADOS is the service abstracted from Ceph.

2003 A Fast Algorithm for Online Placement and Reorganization of Replicated Data
R. J. Honicky, Ethan L. Miller

The initial pseudo-random data distribution algorithm (also Rushp in Rush family), based on which CRUSH is built. It supports weighted devices, adding or removing devices dynamically while achieving optimal data migration and data distribution. RUSHp utilizes an advanced analytic number theory result called the Prime Number Theorem for Arithmetic Progressions.

2004 Replication Under Scalable Hashing: A Family of Algorithms for Scalable Decentralized Data Distribution
R. J. Honicky, Ethan L. Miller

Excellent algorithm family of so called Scalable Distributed Data Structures.

Consensus Paper

Consensus Theory

Recently, I studied a series of influential papers regarding consensus. Consensus under non-Byzantine circumstance is the theory backing up distributed systems such as Chubby (Paxos), etcd (Raft) and Zookeeper (Zab). These algorithms, together with their correctness reasoning are hard to interpret but worth the effort.

1978 Time, Clocks, and the Ordering of Events in a Distributed System
Leslie Lamport

This paper, brought by Leslie Lamport, the Turing award winner in 2013, won ACM SIGOPS Hall of Fame Award (2007) and Dijkstra award (2000). It introduced partial order and global order in distributed environment, which greatly influenced happens-before concept in multi-threaded program. The vector clock presented in the paper also inspired multithreading race detection of Golang race detector, whose paper is at Vector clock, How Developers Use Data Race Detection Tools.

1985 Impossibility of Distributed Consensus with One Faulty Process
Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson

The amazing paper, which won Dijkstra award (2001), asserts no completely correct asynchronous consensus algorithm exists even when only one faulty process is tolerated. Lemma 3 in the paper is mind boggling, better interpreted with help of A Brief Tour of FLP Impossibility.

2001 Paxos Made Simple
Leslie Lamport

This is the paper written by Lamport himself after the famously rejected paper The Part-Time Parliament. The anecdote can be found on Leslie Lamport Writings.
The paper clearly discusses single paxos, that is, deciding a single value from processes but does not explain well on multi-paxos, that is, deciding a series of values.

1996 How to Build a Highly Available System Using Consensus
Butler W. Lampson

This is the paper Lampson, 1992 Turning Award winner, reinterpreted and made Paxos publicity. It explains Paxos from another perspective, perhaps making people earier to undertand.

2007 Paxos Made Live – An Engineering Perspective
Tushar Deepak Chandra, Robert Griesemer, Joshua Redstone

Google Chubby team implemented Chubby using multi-paxos, with which, replicated state machine is made possible. The paper talks about many engineering issues when putting it into practical use. I found another article that well explains multi-paxos in an easier approach by Tencent Weixin team (in Chinese) Weixin PhxPaxos. Yet there is a slide presenting consensus history, Paxos family and replicated state machine,
Distributed Consensus: Making Impossible Possible.

2014 In search of an understandable consensus algorithm
Diego Ongaro, John Ousterhout

Raft, the replicate state machine consensus algorithm, was designed to be undertood with less effort compared to Paxos family. Nowadays, quite a lot consensus systems, most notably etcd, are implemented using raft.

2011 Zab: High-performance broadcast for primary-backup systems
Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini

Core protocol of Zookeeper

2012 ZooKeeper’s atomic broadcast protocol: Theory and practice
Andr ́e Medeiros

2012 Viewstamped Replication Revisited
Barbara Liskov, James Cowling

Barbara Liskov, Turing award winner in 2008, revisted another replicate state machine consensus algorithm.

Powered by WordPress & Theme by Anders Norén