Month: August 2016

Selective Paper Reading List 2016

Inspired by Paper we love, in this blog I listed selected papers that are read, and to be read systematically in 2016.  The list is continuously growing, reflecting my explored and conquered topics in 2016.  Most of the papers fall in the category of distributed computing including classic theoretical researches and well-known system design and implementation.

Distributed Computing Theory

1978 Time, Clocks, and the Ordering of Events in a Distributed System
Leslie Lamport
1985 Distributed snapshots: determining global states of distributed systems
KM Chandy, L Lamport
2007 Paxos Made Live – An Engineering Perspective
Tushar Deepak Chandra, Robert Griesemer, Joshua Redstone
2012 CAP Twelve Years Later: How the “Rules” Have Changed
Eric Brewer
2014 In search of an understandable consensus algorithm
D Ongaro, J Ousterhout

Distributed Systems

2003 The Google file system
S Ghemawat, H Gobioff
2004 MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean, Sanjay Ghemawat
2006 The Chubby lock service for loosely-coupled distributed systems
M Burrows
2007 Dynamo: amazon’s highly available key-value store
G DeCandia, D Hastorun, M Jampani…
2008 Bigtable: A Distributed Storage System for Structured Data
F Chang, J Dean, S Ghemawat et al.
2008 Bitcoin: A Peer-to-Peer Electronic Cash System
S Nakamoto
2010 The Hadoop Distributed File System
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler
2010 Hive – A Petabyte Scale Data Warehouse Using Hadoop
A Thusoo, JS Sarma, N Jain, Z Shao
2010 Spark: Cluster Computing with Working Sets
M Zaharia, M Chowdhury, MJ Franklin et al.
2010 ZooKeeper: Wait-free coordination for Internet-scale systems
P Hunt, M Konar, FP Junqueira
2010 Finding a needle in Haystack: Facebook’s photo storage
D Beaver, S Kumar, HC Li, J Sobel, P Vajge
2010 Cassandra: a decentralized structured storage system
A Lakshman, P Malik
2011 Kafka: a Distributed Messaging System for Log Processing
J Kreps, N Narkhede
2011 Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
Benjamin Hindman, Andy Konwinski et al.
2013 Apache hadoop yarn: Yet another resource negotiator
VK Vavilapalli, AC Murthy, C Douglas…
2013 Scaling Memcache at Facebook
R Nishtala, H Fugal, S Grimm, M Kwiatkowski

Stream Processing and Database

1993 The Volcano Optimizer Generator: Extensibility and Efficient Search
G. Graefe, W. J. McKenna
1996 Implementing data cubes efficiently
V. Harinarayan, A. Rajaraman, J. Ullman
2013 MillWheel: Fault-Tolerant Stream Processing at Internet Scale
T Akidau, A Balikov, K Bekiroğlu et al.
2015 The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
T Akidau, R Bradshaw et al.
2015 Lightweight Asynchronous Snapshots for Distributed Dataflows
P Carbone, G Fóra, S Ewen, S Haridi et al.
2016 SamzaSQL: Scalable Fast Data Management with Streaming SQL
M Pathirage, J Hyde et al.

Functional Programming

1989 Why Functional Programming Matters
J Hughes
1992 The Essence of Functional Programming
P Wadler
1995 Monads for functional programming
P Wadler

Algorithms

1985 Random Sampling with a Reservoir
Jeffrey Scott Vitter
1985 Self-Adjusting Binary Search Trees
DD Sleator, RE Tarjan
1995 On-line construction of suffix trees
Esko Ukkonen
2011 A Comprehensive Study of Convergent and Commutative Replicated Data Types
Mark Shapiro, Nuno Preguiça, Carlos Baquero, Marek Zawirski
2015 Efficient Range Minimum Queries using Binary Indexed Trees
M DIMA, R CETERCHI

Union-Find

The union-find is a classic data structure used in lots of algorithms, for instance, cycle detection in undirected graph, Tarjan lowest common ancestor.  The code illustated in this blog passes AOJ Disjoint Set: Union Find Tree and available at Github AlgImpl.

The Problem

Given n elements, initially each element is in a different set, {1}, {2}, …, {n}.
Two operations are supported on the sets by union-find, which are union and find.

  • A union operation combines two sets into one.
  • A find operation identifies the set that contains a particular element.

An intermixed sequence of union and find operations is performed. We need to devise most efficient algorithm for union and find operations.

Set Representation

An int array id is a parent-link reprensentation of a forest of trees. A concrete example is shown below

idx 0 1 2 3 4 5 6 7 8 9
parent 0 1 1 8 3 0 5 7 8 8

forest of trees
Thus, each set can be represented by the ancestor of the tree. This is the meaning of what Find returns.

Optimizations

Two optimizations are applied to boost performance.

  1. Make trees Weighted in order to append lighter tree to heavier one.
  2. Path compression to flatten tree

These two tricks result in amortized time complexity of find and union to be almost O(1), or to be more exact, inverse of Ackermann’s function.

Full Java Code

public class UnionFind {
    int[] id;
    int count;
    int[] weight;  // size indexed by root id

    public UnionFind(int n) {
        id = new int[n];
        weight = new int[n];
        count = n;
        for (int idx = 0; idx < id.length; idx++) {
            id[idx] = idx;
            weight[idx] = 1;
        }
    }

    public void union(int p, int q) {
        int pRoot = find(p);
        int qRoot = find(q);
        if (pRoot == qRoot) {
            return;
        }
        // make smaller root point to larger one
        if (weight[pRoot] < weight[qRoot]) {
            id[pRoot] = qRoot;
            weight[qRoot] += weight[pRoot];
        } else {
            id[qRoot] = pRoot;
            weight[pRoot] += weight[qRoot];
        }
        count--;
    }

    // path compression
    public int find(int p) {
        if (id[p] != p) {
            id[p] = find(id[p]);
        }
        return id[p];
    }

    public boolean connected(int p, int q) {
        return find(p) == find(q);
    }

    public int count() {
        return count;
    }
}

Code in ME-Mydoc

union-find mydoc

Powered by WordPress & Theme by Anders Norén