Locality Sensitive Hashing: Minhash vs Simhash

Locality-sensitive hashing (LSH) reduces the dimensionality of high-dimensional data where LSH differs from conventional and cryptographic hash functions because it aims to maximize the probability of a “collision” for similar items. The most widely used hashings to detect near duplicate text documents are min-hash (by Border) and sim-hash (Charikar). In this series, we study min-hash and sim-has in details.

• On the resemblance and containment of documents
Broder gives insight how min-hash works from engineering pespective.

• Similarity Estimation Techniques from Rounding Algorithms
Charikar gives methematical idea of simhash and general property of locality sensitive hashing.

• Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms
Google did experiment comparing minhash and simhash on large scale.

• Min-Wise Independent Permutations
Broder and Charikar provide maths proof for min-hash upper and lower bounds.

• Detecting Near-Duplicates for Web Crawling
Google paper to address Hamming Distance Problem once a hashing is decided, i.e. given a collection of f-bit fingerprints and a query fingerprint $\mathcal F$, identify whether an existing fingerprint differs from $\mathcal F$ in at most $k$ bits.

On the resemblance and containment of documents

1997 Andrei Z. Broder

Resemblance and Containment

Define resemblance $r(A, B)$ of two documents, $A$ and $B$, is a number between 0 and 1, such that when the resemblance is close to 1 it is likely that the documents are roughly the same. Similarly, the containment $c(A, B)$ of $A$ in $B$ is a number between 0 and 1 that, when close to 1, indicates that A is roughly contained within B.

The idea of approaching resemblance and containment is to keep $k$ sketches of each document $A$

$S_A = (sketch_1, sketch_2, \cdots, sketch_k)$

Then resemblance of $A$ and $B$ becomes
$$r(A,B) = \frac{\vert S(A) \cap S(B) \vert}{\vert S(A) \cup S(B) \vert}$$

and containment becomes
$$c(A,B) = \frac{\vert S(A) \cap S(B) \vert}{\vert S(A) \vert}$$

Shingling

A contiguous subsequence contained in $D$ is called a shingle.

For example,

$D=(a, rose, is, a, rose, is, a, rose)$

then, $S(D,w=4)$, 4-shingling of $D$ is the bag

${(a, rose, is, a), (rose, is, a, rose), (is, a, rose, is), (a, rose, is, a), (rose, is, a, rose)}$

There are 2 options to derive $S(D,w)$

• Option A: labelled shingling, count occurrence number
$S(D,w=4) = {(a, rose, is, a, 1), (rose, is, a, rose, 1), (is, a, rose, is, 1), (a, rose, is, a, 2), (rose, is, a, rose, 2)}$
• Option B: only occurrence, not count
$S(D,w=4) = {(a, rose, is, a), (rose, is, a, rose), (is, a, rose, is)}$

Once selecting one option, resemblance and containment are defined as
$$r_w(A,B) = \frac{\vert S(A,w) \cap S(B,w) \vert}{\vert S(A,w) \cup S(B,w) \vert}$$

$$c_w(A,B) = \frac{\vert S(A,w) \cap S(B,w) \vert}{\vert S(A,w) \vert}$$

Estimating Resemblance and Containment

Let $\Omega​$ be the set of all labelled or unlabelled shingles and $\Omega​$ is totally ordered. Fix a parameter $s​$, for a set $W \subseteq \Omega​$, define
$$MIN_s(W) =\begin{cases} \text{the set of the smallest s elements in W, if \vert W \vert \ge s; } \ \text{W, otherwise.}\end{cases}$$

For a set $I \subseteq \mathcal N$
$$MOD_m(I)=\text{the set of elements of I that are 0\mod m.}$$

Theorem: Let $g : \Omega \to \mathcal N$ be an arbitrary injection, let $\pi : \Omega \to \Omega$ be a permutation of $\Omega$ chosen uniformly at random and let $M(A)=MIN_s(\pi(S(A,w)))$ and $L(A)=MOD_s(g(\pi(S(A,w))))$
1. ​
$$\widehat r_w(A,B) = \frac{\vert MIN_s(M(A) \cup M(B)) \cap M(A) \cap M(B) \vert}{\vert MIN_s(M(A) \cup M(B)) \vert}$$
is an unbiased estimate of the resemblance of $A$ and $B$.

2.
$$\widehat r_w(A,B) = \frac{\vert L(A) \cup L(B) \vert}{\vert L(A) \cap L(B) \vert}$$
is an unbiased estimate of the resemblance of $A$ and $B$.

3.
$$\widehat c_w(A,B) = \frac{\vert L(A) \cap L(B) \vert}{\vert L(A) \vert}$$
is an unbiased estimate of the containment of $A$ and $B$.

Implementation Issues

Instead of keeping each shingle as it is, which cost much in storage, we first associate to each shingle a (shorter) id of $l$ bits, and then use a random permutation $\pi$ of the set ${0, \ldots, 2^l -1 }$. Fix $\pi$ and let it be $f$:

$f: \Omega \to {0, \ldots, 2^l -1 }$

Then the estimated resemblance is
$$r_{w,f}(A,B) = \frac{\vert f(S(A,w)) \cap f(S(B,w)) \vert}{\vert f(S(A,w) )\cup f(S(B,w)) \vert}$$

In implementation, Rabin fingerprints are used because their probability of collision is well understood and can be computed very efficiently in software.

Similarity Estimation Techniques from Rounding Algorithms

2002 Moses S. Charikar

Definition of Locality Sensitive Hashing

A locality sensitive hashing scheme is a distribution on a family $\mathcal F$ of hash functions operating on a collection of objects, such that for two objects $x$, $y$,
$$\mathbf {Pr}_{h \in \mathcal F}[h(x)=h(y)] = sim(x, y)$$
where $sim(x, y) \in [0, 1]$ is some similarity function defined on the collection of objects.

The paper proves property for certain similarity function that leads to existance of locality sensitive hashing scheme.
For example, for Jaccard coefficient of similarity
$$sim(A,B) = \frac{\vert A \cap B \vert}{\vert A \cup B \vert}$$
min-wise independent permutations scheme allows the construction of a distribution on hash functions $h: 2^U \to U$ such that
$$\mathbf {Pr}_{h \in \mathcal F}[h(x)=h(y)] = sim(x, y)$$

Random Hyperplane Based Hash Functions for Vectors

Given a collection of vectors in $R^d$, we consider the family of hash functions defined as follows: We choose a random vector $\vec r$ from the $d$-dimensional Gaussian distribution. Corresponding to this vector $\vec r$, we define a hash function $h_{\vec r}$ as follows:
$$h_{\vec r}(\vec u) =\begin{cases} 1 & \text{if \vec r \cdot \vec u \ge 0 } \ 0 & \text{if \vec r \cdot \vec u \lt 0 } \end{cases}$$

Then for vectors $\vec u$ and $\vec v$:

$$\mathbf {Pr}_{h \in \mathcal F}[h(\vec u)=h(\vec v)] = 1 – \frac{\theta(\vec u, \vec v)}{\pi}$$
where $\theta$ is
$$\theta = \cos^{-1}\left(\frac{\vert A \cap B\vert}{\sqrt {\vert A \vert \cdot \vert B \vert }} \right)$$

The Earth Mover Distance and Rounding Scheme

Earth Mover Distance Given a set of points $L={l_1, \ldots, l_n}$, with a distance function $d(i, j)$ defined on them. A probability distribution $P(X)$is a set of weights $p_1, \ldots, p_n$ on the points such that $p_i \ge 0$ and $p_i = 1$. The Earth Mover Distance $\mathbf {EMD}(P,Q)$ between two distributions $P$ and $Q$ is defined to be the cost of the min cost matching that transforms one distribution to another.

Theorem 1. The Kleinberg Tardos rounding scheme yields a locality sensitive hashing scheme such that

$$\mathbf {EMD}(P,Q) \le \mathbf E[d(h(P), d(h(Q)))] \le O(\log n \log \log n) \mathbf {EMD}(P,Q)$$

Kleinberg and Tardos rounding algorithm can be viewed as a generalization of min-wise independent permutations extended to a continuous setting. Their rounding procedure yields a locality sensitive hash function for vectors whose coordinates are all non-negative. Given two vectors $\vec a=(a_1, \cdots, a_n)$ and $\vec b=(b_1, \cdots, b_n)$, the similarity function is
$$sim(\vec a, \vec b) = \frac{\Sigma_i \min(a_i, b_i)}{\Sigma_i \max(a_i, b_i)}$$
Note that when $\vec a$ and $\vec b$ are the characteristic vectors for sets $A$ and $B$, this expression reduces to the set similarity measure for min-wise independent permutations.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms

2006 Monika Henzinger

Google did experiment comparing the two algorithms on a very large scale, namely on a set of 1.6B distinct web pages. The results show that Charikar’s algorithm finds more near-duplicate pairs on different sites, it achieves a better precision overall, namely 0.50 versus 0.38 for Broder et al. ’s algorithm but neither of the algorithms works well for finding near-duplicate pairs on the same site.

Furhter they gave a combined algorithm: First compute all B-similar pairs. Then filter out those pairs whose C-similarity falls below a certain threshold.

Min-Wise Independent Permutations

2000 Andrei Z. Broder, Moses Charikar, Alan M. Frieze, Michael Mitzenmacher

Definition

We say that $\mathcal F \subseteq S_n$ is min-wise independent if for any set $X \subseteq [n]$ and any $x \in X$, when $\pi$ is chosen at random in $\mathcal F$ we have

$$\mathbf {Pr}(min{\pi(X)} = \pi(x)) = \frac{1}{\vert X \vert}$$

In other words we require that all the elements of any fixed set X have an equal chance to become the minimum element of the image of $X$ under $\pi$.

Approximation

We say that $\mathcal F \subseteq S_n$ is approximately min-wise independent with relative error ε if for any set $X \subseteq [n]$ and any $x \in X$, when $\pi$ is chosen at random in $\mathcal F$ we have
$$\left\vert \mathbf {Pr}(min{\pi(X)} = \pi(x)) – \frac{1}{\vert X \vert} \right\vert = \frac{\epsilon}{\vert X \vert}$$
In other words we require that all the elements of any fixed set $X$ have only an almost equal chance to become the minimum element of the image of $X$ under $\pi$. The expected relative error made in evaluating resemblance using approximately min-wise independent families is less than ε.

Detecting Near-Duplicates for Web Crawling

2007 Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma

Hamming Distance Problem

Given a collection of f-bit fingerprints and a query fingerprint $\mathcal F$, identify whether an existing fingerprint differs from $\mathcal F$ in at most $k$ bits.

Algorithm

Build $t$ tables: $T_1, T_2, \ldots, T_t$, table $T_t$ is actually a table keeping all re-ordered bits of all fingerprints, where the entry of the table is bit segment taken from original one. For example:

$$b_1, b_0, \ldots, b_{t_1}, b_{t_2}, \ldots, b_{t_p}, b_m, b_{m+1}, \ldots, b_f \to$$
$$\text{key in table: } b_{t_1}, \ldots, b_{t_p} \text{value in table: } b_1, b_0, \ldots, b_m, b_{m+1}, \ldots, b_f$$

So associated with table $T_i$ are two quantities: an integer $p_i$ (table key bit length) and a permutation $\pi$ over the f bit-positions.

Given fingerprint $\mathcal F$ and an integer $k$, we can now probe these tables in parallel:
Step 1: Identify all permuted fingerprints in $T_i$ whose top $p_i$ bit-positions match the top $p_i$ bit-positions of $\pi_i(\mathcal F)$.
Step 2: For each of the permuted fingerprints identified in Step 1, check if it differs from $\pi_i(\mathcal F)$ in at most $k$ bit-positions.

So we expect $2^{d-p_i}$ fingerprints as matches, if we seek all (permuted) fingerprints which match the top $p_i$ bits of a given (permuted) fingerprint.

Example

Consider $f = 64$ (64-bit fingerprints), and $k = 3$ so near-duplicates’ fingerprints differ in at most 3 bit-positions. Assume we have 8B = 234 existing fingerprints, i.e. $d = 34$.

We split 64 bits into 4 blocks, each having 16 bits. There are ${4 \choose 1}= 4$ ways of choosing 1 out of these 4 blocks (3 bit can only be scattered in other 3 blocks). For each such choice, we divide the remaining 48 bits into four blocks having 12 bits each. There are
${4 \choose 1} = 4$ ways of choosing 1 out of these 4 blocks. The permutation for a table corresponds to placing the bits in the chosen blocks in the leading positions. The value of $p_i$ is $28 =16+12$ for all blocks. On average, a probe retrieves $234−28 = 64$ (permuted) fingerprints.

Ceph Distributed Storage

Ceph is a distributed file system pioneered by Sage A. Weil which maximizes distributed capability of file storage in three aspects