Category: Paper

Locality Sensitive Hashing: MinHash vs SimHash

Locality Sensitive Hashing: Minhash vs Simhash

Locality-sensitive hashing (LSH) reduces the dimensionality of high-dimensional data where LSH differs from conventional and cryptographic hash functions because it aims to maximize the probability of a “collision” for similar items. The most widely used hashings to detect near duplicate text documents are min-hash (by Border) and sim-hash (Charikar). In this series, we study min-hash and sim-has in details.

  • On the resemblance and containment of documents
    Broder gives insight how min-hash works from engineering pespective.

  • Similarity Estimation Techniques from Rounding Algorithms
    Charikar gives methematical idea of simhash and general property of locality sensitive hashing.

  • Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms
    Google did experiment comparing minhash and simhash on large scale.

  • Min-Wise Independent Permutations
    Broder and Charikar provide maths proof for min-hash upper and lower bounds.

  • Detecting Near-Duplicates for Web Crawling
    Google paper to address Hamming Distance Problem once a hashing is decided, i.e. given a collection of f-bit fingerprints and a query fingerprint $\mathcal F$, identify whether an existing fingerprint differs from $\mathcal F$ in at most $k$ bits.

On the resemblance and containment of documents

1997 Andrei Z. Broder

Resemblance and Containment

Define resemblance $r(A, B)$ of two documents, $A$ and $B$, is a number between 0 and 1, such that when the resemblance is close to 1 it is likely that the documents are roughly the same. Similarly, the containment $c(A, B)$ of $A$ in $B$ is a number between 0 and 1 that, when close to 1, indicates that A is roughly contained within B.

The idea of approaching resemblance and containment is to keep $k$ sketches of each document $A$

$ S_A = (sketch_1, sketch_2, \cdots, sketch_k)$

Then resemblance of $A$ and $B$ becomes
r(A,B) = \frac{\vert S(A) \cap S(B) \vert}{\vert S(A) \cup S(B) \vert}

and containment becomes
c(A,B) = \frac{\vert S(A) \cap S(B) \vert}{\vert S(A) \vert}


A contiguous subsequence contained in $D$ is called a shingle.

For example,

$D=(a, rose, is, a, rose, is, a, rose)$

then, $S(D,w=4)$, 4-shingling of $D$ is the bag

$ {(a, rose, is, a), (rose, is, a, rose), (is, a, rose, is), (a, rose, is, a), (rose, is, a, rose)}$

There are 2 options to derive $S(D,w)$

  • Option A: labelled shingling, count occurrence number
    $S(D,w=4) = {(a, rose, is, a, 1), (rose, is, a, rose, 1), (is, a, rose, is, 1), (a, rose, is, a, 2), (rose, is, a, rose, 2)}$
  • Option B: only occurrence, not count
    $S(D,w=4) = {(a, rose, is, a), (rose, is, a, rose), (is, a, rose, is)}$

Once selecting one option, resemblance and containment are defined as
r_w(A,B) = \frac{\vert S(A,w) \cap S(B,w) \vert}{\vert S(A,w) \cup S(B,w) \vert}

c_w(A,B) = \frac{\vert S(A,w) \cap S(B,w) \vert}{\vert S(A,w) \vert}

Estimating Resemblance and Containment

Let $\Omega​$ be the set of all labelled or unlabelled shingles and $\Omega​$ is totally ordered. Fix a parameter $s​$, for a set $W \subseteq \Omega​$, define
MIN_s(W) =\begin{cases} \text{the set of the smallest $s$ elements in $W$, if $\vert W \vert \ge s$; } \ \text{$W$, otherwise.}\end{cases}

For a set $I \subseteq \mathcal N$
MOD_m(I)=\text{the set of elements of $I$ that are $0\mod m$.}

Theorem: Let $g : \Omega \to \mathcal N$ be an arbitrary injection, let $\pi : \Omega \to \Omega$ be a permutation of $\Omega$ chosen uniformly at random and let $M(A)=MIN_s(\pi(S(A,w)))$ and $L(A)=MOD_s(g(\pi(S(A,w))))$
1. ​
\widehat r_w(A,B) = \frac{\vert MIN_s(M(A) \cup M(B)) \cap M(A) \cap M(B) \vert}{\vert MIN_s(M(A) \cup M(B)) \vert}
is an unbiased estimate of the resemblance of $A$ and $B$.

\widehat r_w(A,B) = \frac{\vert L(A) \cup L(B) \vert}{\vert L(A) \cap L(B) \vert}
is an unbiased estimate of the resemblance of $A$ and $B$.

\widehat c_w(A,B) = \frac{\vert L(A) \cap L(B) \vert}{\vert L(A) \vert}
is an unbiased estimate of the containment of $A$ and $B$.

Implementation Issues

Instead of keeping each shingle as it is, which cost much in storage, we first associate to each shingle a (shorter) id of $l$ bits, and then use a random permutation $\pi$ of the set ${0, \ldots, 2^l -1 }$. Fix $\pi$ and let it be $f$:

$f: \Omega \to {0, \ldots, 2^l -1 }$

Then the estimated resemblance is
r_{w,f}(A,B) = \frac{\vert f(S(A,w)) \cap f(S(B,w)) \vert}{\vert f(S(A,w) )\cup f(S(B,w)) \vert}

In implementation, Rabin fingerprints are used because their probability of collision is well understood and can be computed very efficiently in software.

Similarity Estimation Techniques from Rounding Algorithms

2002 Moses S. Charikar

Definition of Locality Sensitive Hashing

A locality sensitive hashing scheme is a distribution on a family $\mathcal F$ of hash functions operating on a collection of objects, such that for two objects $x$, $y$,
\mathbf {Pr}_{h \in \mathcal F}[h(x)=h(y)] = sim(x, y)
where $sim(x, y) \in [0, 1]$ is some similarity function defined on the collection of objects.

The paper proves property for certain similarity function that leads to existance of locality sensitive hashing scheme.
For example, for Jaccard coefficient of similarity
sim(A,B) = \frac{\vert A \cap B \vert}{\vert A \cup B \vert}
min-wise independent permutations scheme allows the construction of a distribution on hash functions $h: 2^U \to U$ such that
\mathbf {Pr}_{h \in \mathcal F}[h(x)=h(y)] = sim(x, y)

Random Hyperplane Based Hash Functions for Vectors

Given a collection of vectors in $R^d$, we consider the family of hash functions defined as follows: We choose a random vector $\vec r$ from the $d$-dimensional Gaussian distribution. Corresponding to this vector $\vec r$, we define a hash function $h_{\vec r}$ as follows:
h_{\vec r}(\vec u) =\begin{cases} 1 & \text{if $\vec r \cdot \vec u \ge 0$ } \ 0 & \text{if $\vec r \cdot \vec u \lt 0$ } \end{cases}

Then for vectors $\vec u$ and $\vec v$:

\mathbf {Pr}_{h \in \mathcal F}[h(\vec u)=h(\vec v)] = 1 – \frac{\theta(\vec u, \vec v)}{\pi}
where $\theta$ is
\theta = \cos^{-1}\left(\frac{\vert A \cap B\vert}{\sqrt {\vert A \vert \cdot \vert B \vert }} \right)

The Earth Mover Distance and Rounding Scheme

Earth Mover Distance Given a set of points $L={l_1, \ldots, l_n}$, with a distance function $d(i, j)$ defined on them. A probability distribution $P(X)$is a set of weights $p_1, \ldots, p_n$ on the points such that $p_i \ge 0$ and $p_i = 1$. The Earth Mover Distance $\mathbf {EMD}(P,Q)$ between two distributions $P$ and $Q$ is defined to be the cost of the min cost matching that transforms one distribution to another.

Theorem 1. The Kleinberg Tardos rounding scheme yields a locality sensitive hashing scheme such that

\mathbf {EMD}(P,Q) \le \mathbf E[d(h(P), d(h(Q)))] \le O(\log n \log \log n) \mathbf {EMD}(P,Q)

Kleinberg and Tardos rounding algorithm can be viewed as a generalization of min-wise independent permutations extended to a continuous setting. Their rounding procedure yields a locality sensitive hash function for vectors whose coordinates are all non-negative. Given two vectors $\vec a=(a_1, \cdots, a_n)$ and $\vec b=(b_1, \cdots, b_n)$, the similarity function is
sim(\vec a, \vec b) = \frac{\Sigma_i \min(a_i, b_i)}{\Sigma_i \max(a_i, b_i)}
Note that when $\vec a$ and $\vec b$ are the characteristic vectors for sets $A$ and $B$, this expression reduces to the set similarity measure for min-wise independent permutations.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms

2006 Monika Henzinger

Google did experiment comparing the two algorithms on a very large scale, namely on a set of 1.6B distinct web pages. The results show that Charikar’s algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall, namely 0.50 versus 0.38 for Broder et al. ’s algorithm but neither of the algorithms works well for finding near-duplicate pairs on the same site.

Furhter they gave a combined algorithm: First compute all B-similar pairs. Then filter out those pairs whose C-similarity falls below a certain threshold.

Min-Wise Independent Permutations

2000 Andrei Z. Broder, Moses Charikar, Alan M. Frieze, Michael Mitzenmacher


We say that $\mathcal F \subseteq S_n$ is min-wise independent if for any set $X \subseteq [n]$ and any $x \in X$, when $\pi$ is chosen at random in $\mathcal F$ we have

\mathbf {Pr}(min{\pi(X)} = \pi(x)) = \frac{1}{\vert X \vert}

In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of $X$ under $\pi$.


We say that $\mathcal F \subseteq S_n$ is approximately min-wise independent with relative error ε if for any set $X \subseteq [n]$ and any $x \in X$, when $\pi$ is chosen at random in $\mathcal F$ we have
\left\vert \mathbf {Pr}(min{\pi(X)} = \pi(x)) – \frac{1}{\vert X \vert} \right\vert = \frac{\epsilon}{\vert X \vert}
In other words we require that all the elements of any fixed set $X$ have only an almost equal chance to become the minimum element of the image of $X$ under $\pi$. The expected relative error made in evaluating resemblance using approximately min-wise independent families is less than ε.

Detecting Near-Duplicates for Web Crawling

2007 Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma

Hamming Distance Problem

Given a collection of f-bit fingerprints and a query fingerprint $\mathcal F$, identify whether an existing fingerprint differs from $\mathcal F$ in at most $k$ bits.


Build $t$ tables: $T_1, T_2, \ldots, T_t$, table $T_t$ is actually a table keeping all re-ordered bits of all fingerprints, where the entry of the table is bit segment taken from original one. For example:

$$b_1, b_0, \ldots, b_{t_1}, b_{t_2}, \ldots, b_{t_p}, b_m, b_{m+1}, \ldots, b_f \to $$
$$\text{key in table: } b_{t_1}, \ldots, b_{t_p} \text{value in table: } b_1, b_0, \ldots, b_m, b_{m+1}, \ldots, b_f$$

So associated with table $T_i$ are two quantities: an integer $p_i$ (table key bit length) and a permutation $\pi$ over the f bit-positions.

Given fingerprint $\mathcal F$ and an integer $k$, we can now probe these tables in parallel:
Step 1: Identify all permuted fingerprints in $T_i$ whose top $p_i$ bit-positions match the top $p_i$ bit-positions of $\pi_i(\mathcal F)$.
Step 2: For each of the permuted fingerprints identified in Step 1, check if it differs from $\pi_i(\mathcal F)$ in at most $k$ bit-positions.

So we expect $2^{d-p_i}$ fingerprints as matches, if we seek all (permuted) fingerprints which match the top $p_i$ bits of a given (permuted) fingerprint.


Consider $f = 64$ (64-bit fingerprints), and $k = 3$ so near-duplicates’ fingerprints differ in at most 3 bit-positions. Assume we have 8B = 234 existing fingerprints, i.e. $d = 34$.

We split 64 bits into 4 blocks, each having 16 bits. There are ${4 \choose 1}= 4$ ways of choosing 1 out of these 4 blocks (3 bit can only be scattered in other 3 blocks). For each such choice, we divide the remaining 48 bits into four blocks having 12 bits each. There are
${4 \choose 1} = 4$ ways of choosing 1 out of these 4 blocks. The permutation for a table corresponds to placing the bits in the chosen blocks in the leading positions. The value of $p_i$ is $28 =16+12$ for all blocks. On average, a probe retrieves $234−28 = 64$ (permuted) fingerprints.

Ceph Paper

Ceph Distributed Storage

Ceph is a distributed file system pioneered by Sage A. Weil which maximizes distributed capability of file storage in three aspects
1. Decoupled Data and Metadata
2. Dynamic Distributed Metadata Management
3. Autonomic Distributed Object Storage

Ceph’s essence is CRUSH (Controlled Replication Under Scalable Hashing) algorithm, a probabilistic hash algorithm (think of consistent hashing) based on RUSH algorithm, allowing all clients to compute location of any file data segment without centralized service (name node, typically, HDFS). CRUSH algorithm evolves from prior pseudo-random data distribution algorithms described below. Note that all papers list below in this blog are from SSRC (Storage Systems Research Center), University of California, Santa Cruz.

2006 Ceph: A Scalable, High-Performance Distributed File System
Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long

Sage’s overview of Ceph system, covering design principles, CRUSH, dynamic metadata subtree partitioning and relaxed POSIX file metadata semantics.

2006 CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data
Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Carlos Maltzahn

Explains how CRUSH, the pseudo-random deterministic function, maps an input value to a list of devices with weights on which to store object replicas. CRUSH extends RUSH by introducing straw buckets strategy (well described in this Chinese blog CRUSH straw ).

2004 Dynamic Metadata Management for Petabyte-scale File Systems
Sage A. Weil, Kristal T. Pollack, Scott A. Brandt, Ethan L. Miller

The dynamic metadata management system, where adaptive subtree partitioning is introduced to achieve both optimal hierarchical tree partitioning for cluster workload and also to load balance large number of clients accessing same file in “flash crowds” way.

2007 RADOS: A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters
Sage A. Weil, Andrew W. Leung, Scott A. Brandt, Carlos Maltzahn

RADOS is the service abstracted from Ceph.

2003 A Fast Algorithm for Online Placement and Reorganization of Replicated Data
R. J. Honicky, Ethan L. Miller

The initial pseudo-random data distribution algorithm (also Rushp in Rush family), based on which CRUSH is built. It supports weighted devices, adding or removing devices dynamically while achieving optimal data migration and data distribution. RUSHp utilizes an advanced analytic number theory result called the Prime Number Theorem for Arithmetic Progressions.

2004 Replication Under Scalable Hashing: A Family of Algorithms for Scalable Decentralized Data Distribution
R. J. Honicky, Ethan L. Miller

Excellent algorithm family of so called Scalable Distributed Data Structures.

Consensus Paper

Consensus Theory

Recently, I studied a series of influential papers regarding consensus. Consensus under non-Byzantine circumstance is the theory backing up distributed systems such as Chubby (Paxos), etcd (Raft) and Zookeeper (Zab). These algorithms, together with their correctness reasoning are hard to interpret but worth the effort.

1978 Time, Clocks, and the Ordering of Events in a Distributed System
Leslie Lamport

This paper, brought by Leslie Lamport, the Turing award winner in 2013, won ACM SIGOPS Hall of Fame Award (2007) and Dijkstra award (2000). It introduced partial order and global order in distributed environment, which greatly influenced happens-before concept in multi-threaded program. The vector clock presented in the paper also inspired multithreading race detection of Golang race detector, whose paper is at Vector clock, How Developers Use Data Race Detection Tools.

1985 Impossibility of Distributed Consensus with One Faulty Process
Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson

The amazing paper, which won Dijkstra award (2001), asserts no completely correct asynchronous consensus algorithm exists even when only one faulty process is tolerated. Lemma 3 in the paper is mind boggling, better interpreted with help of A Brief Tour of FLP Impossibility.

2001 Paxos Made Simple
Leslie Lamport

This is the paper written by Lamport himself after the famously rejected paper The Part-Time Parliament. The anecdote can be found on Leslie Lamport Writings.
The paper clearly discusses single paxos, that is, deciding a single value from processes but does not explain well on multi-paxos, that is, deciding a series of values.

1996 How to Build a Highly Available System Using Consensus
Butler W. Lampson

This is the paper Lampson, 1992 Turning Award winner, reinterpreted and made Paxos publicity. It explains Paxos from another perspective, perhaps making people earier to undertand.

2007 Paxos Made Live – An Engineering Perspective
Tushar Deepak Chandra, Robert Griesemer, Joshua Redstone

Google Chubby team implemented Chubby using multi-paxos, with which, replicated state machine is made possible. The paper talks about many engineering issues when putting it into practical use. I found another article that well explains multi-paxos in an easier approach by Tencent Weixin team (in Chinese) Weixin PhxPaxos. Yet there is a slide presenting consensus history, Paxos family and replicated state machine,
Distributed Consensus: Making Impossible Possible.

2014 In search of an understandable consensus algorithm
Diego Ongaro, John Ousterhout

Raft, the replicate state machine consensus algorithm, was designed to be undertood with less effort compared to Paxos family. Nowadays, quite a lot consensus systems, most notably etcd, are implemented using raft.

2011 Zab: High-performance broadcast for primary-backup systems
Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini

Core protocol of Zookeeper

2012 ZooKeeper’s atomic broadcast protocol: Theory and practice
Andr ́e Medeiros

2012 Viewstamped Replication Revisited
Barbara Liskov, James Cowling

Barbara Liskov, Turing award winner in 2008, revisted another replicate state machine consensus algorithm.

Powered by WordPress & Theme by Anders Norén