Category: Algorithm

Bloom Filter

Bloom Filter is a probabilistic set data structure used to test whether an element is in a set or not.  It has the advantage of fast operation over traditional set data structures but at cost of sacraficing correctness: false positive matches are possible, meaning that a query returns either “possibly in set” or “definitely not in set”. A bloom filter leverages predetermined k hash functions, each of which maps to a location of the underlying bit array for every item added. It’s like sampling of each element and loses precision during the process. Natrually, only add operation is supported but not remove. Not surprisingly, the more elements added into the set, the larger probability of false positives.

To pick up k independent while uniformly distributed hash functions is non trial. But a simplified version of Java implementation described in the blog adopts double hashing strategy, code available @ my github Bloom Filter.

Elements of Implementation

  • The actual data structure representing a bloom filter is a bit array, or in Java, the BitSet of length bitSize.
  • To obtain k distint hash values for each item using double hashing
    `h_i(key) = h_1(key) + i*h_2(key)`

Probabilistic relations

  • Optimal number of hash functions (in Wikipedia) with given input entries and bits size, which would make the false positive probability lowest
    `k=(bitSize)/(estimatedEntries)*ln2`
  • Optimal bits size with given input entries and expected false positive probability
    `bitSize=-(estimatedEntries*ln(fpp))/(ln2)^2`
  • False positive probability based on given input entries and bits size, corresponding optimal number of hash function k
    `E|fpp| = (1-1/(bitSize))^(k*estimatedEntries)`

The github project also includes a test that computes real false positive probability of this implementation.

Core Code Illustrated

    public BloomFilter(int expectedEntries, int byteSize) {
        if (!(expectedEntries > 0)) {
            throw new IllegalArgumentException("expectedEntries should be > 0");
        }
        this.expectedEntries = expectedEntries;
        this.numHashFunctions = optimalNumOfHashFunctions(expectedEntries, byteSize << 3);
        this.bitSet = new BitSet(byteSize * 8);
    }

    public void addHash(int hash32) {
        int hash1 = hash32;
        int hash2 = hash32 >>> 16;

        for (int i = 1; i <= numHashFunctions; i++) {
            int combinedHash = hash1 + (i * hash2);
            // hashcode should be positive, flip all the bits if it's negative
            if (combinedHash < 0) {
                combinedHash = ~combinedHash;
            }
            int pos = combinedHash % bitSet.size();
            bitSet.set(pos);
        }
    }

    public boolean testHash(int hash32) {
        int hash1 = hash32;
        int hash2 = hash32 >>> 16;

        for (int i = 1; i <= numHashFunctions; i++) {
            int combinedHash = hash1 + (i * hash2);
            // hashcode should be positive, flip all the bits if it's negative
            if (combinedHash < 0) {
                combinedHash = ~combinedHash;
            }
            int pos = combinedHash % bitSet.size();
            if (!bitSet.get(pos)) {
                return false;
            }
        }
        return true;
    }

Screen snapshot in ME-Mydoccode_bloomfilter

Union-Find

The union-find is a classic data structure used in lots of algorithms, for instance, cycle detection in undirected graph, Tarjan lowest common ancestor.  The code illustated in this blog passes AOJ Disjoint Set: Union Find Tree and available at Github AlgImpl.

The Problem

Given n elements, initially each element is in a different set, {1}, {2}, …, {n}.
Two operations are supported on the sets by union-find, which are union and find.

  • A union operation combines two sets into one.
  • A find operation identifies the set that contains a particular element.

An intermixed sequence of union and find operations is performed. We need to devise most efficient algorithm for union and find operations.

Set Representation

An int array id is a parent-link reprensentation of a forest of trees. A concrete example is shown below

idx 0 1 2 3 4 5 6 7 8 9
parent 0 1 1 8 3 0 5 7 8 8

forest of trees
Thus, each set can be represented by the ancestor of the tree. This is the meaning of what Find returns.

Optimizations

Two optimizations are applied to boost performance.

  1. Make trees Weighted in order to append lighter tree to heavier one.
  2. Path compression to flatten tree

These two tricks result in amortized time complexity of find and union to be almost O(1), or to be more exact, inverse of Ackermann’s function.

Full Java Code

public class UnionFind {
    int[] id;
    int count;
    int[] weight;  // size indexed by root id

    public UnionFind(int n) {
        id = new int[n];
        weight = new int[n];
        count = n;
        for (int idx = 0; idx < id.length; idx++) {
            id[idx] = idx;
            weight[idx] = 1;
        }
    }

    public void union(int p, int q) {
        int pRoot = find(p);
        int qRoot = find(q);
        if (pRoot == qRoot) {
            return;
        }
        // make smaller root point to larger one
        if (weight[pRoot] < weight[qRoot]) {
            id[pRoot] = qRoot;
            weight[qRoot] += weight[pRoot];
        } else {
            id[qRoot] = pRoot;
            weight[pRoot] += weight[qRoot];
        }
        count--;
    }

    // path compression
    public int find(int p) {
        if (id[p] != p) {
            id[p] = find(id[p]);
        }
        return id[p];
    }

    public boolean connected(int p, int q) {
        return find(p) == find(q);
    }

    public int count() {
        return count;
    }
}

Code in ME-Mydoc

union-find mydoc

Dijkstra Single Source Shortest Path

Dijkstra single source shortest algorithm in adjacency list form, which passes AOJ Single Source Shortest Path Problem is available @Github

The algorithm works like Prim algorithm of generating minimal spanning tree in that each round a vertex is added to the spanning tree. The rule of determining next vertex is minimal total weight from source vertex.

Pseudo algorithm

  • distTo[] is total weight of each vertex from source vertex.
  • minHeap is a min binary heap keeping track of total weight of all vertices not in the spanning tree.
distTo[] initialized to INFINITE
create the minHeap
minHeap.add(source vertex)
while(!minHeap.isEmpty) {
    Vertex minVertex = minHeap.removeMin
    foreach edge incident to minVertex {
        if (the other vertex has shorter distance) {
            update distTo[edge.destVertex]
            update minHeap of edge.destVertex
        }
    }
}

Technicalities

PriorityQueue in java.util is used as the binary heap but it lacks update(key, newValue) API. We wrap a PriorityQueue to mimic the bahavior by first removing the object if it is the queue and add to the queue again.

Complexity Analysis

Time complexity of Dijkstra is O(E*log(V)) because at most E edges are iterated and at most V vertices in the binary heap.

Core Java code

public static class DijkstraShortestPath {

    Integer[] distTo; // index is vertex
    Integer[] edgeTo;

    public DijkstraShortestPath(WeightedDigraph graph, int sourceVertex) {
        distTo = new Integer[graph.numVertex];
        edgeTo = new Integer[graph.numVertex];

        distTo[sourceVertex] = 0;
        edgeTo[sourceVertex] = sourceVertex;
        MinPriorityQ&amp;lt;PathWithV&amp;gt; minPQ = new MinPriorityQ&amp;lt;PathWithV&amp;gt;(new Comparator&amp;lt;PathWithV&amp;gt;() {
            @Override
            public int compare(PathWithV o1, PathWithV o2) {
                return o1.path - o2.path;
            }
        });
        PathWithV p = new PathWithV(sourceVertex, 0);
        minPQ.add(p);
        while (!minPQ.isEmpty()) {
            int vertex = minPQ.remove().vertex;
            for (WeightedDigraph.Edge edge : graph.srcEdges[vertex]) {
                Integer targetVertexPath = distTo[edge.targetVertex];
                if (targetVertexPath == null || distTo[vertex] + edge.weight &amp;lt; targetVertexPath) {
                    distTo[edge.targetVertex] = distTo[vertex] + edge.weight;
                    edgeTo[edge.targetVertex] = vertex;
                    p = new PathWithV(edge.targetVertex, distTo[edge.targetVertex]);
                    minPQ.addOrUpdateKey(p);
                }
            }
        }
    }
}

Screenshot in ME-Mydoc
dijkstra

Apriori (Java)

Algorithm Origin: Coursera: Mining of Massive Datasets
Code at GitHub Apriori

Apriori is algorithm for frequent item set mining. For example, given a item set where each line indicates a set of item.

1 2 4
1 3 4
1 4 5 6
1 2 3 4
2 3
4 5 6
7

The data given above has 7 item sets, the first one being items numbered 1, 2 and 4. Suppose we want to find all item sets having appearance of more than two, Apriori offers an iterative approach. First, Apriori computes all items having more than (or equal) appearance of two. And each later round result sets of length one more than current matched sets are computed.
For the example above, output of each round is listed below

Round 1
1,
2,
3,
4,
5,
6,
Round 2
1, 2,
1, 3,
1, 4,
2, 3,
2, 4,
3, 4,
4, 5,
4, 6,
5, 6,
Round 3
1, 2, 4,
1, 3, 4,
4, 5, 6,

In round 3, 3 item sets, (1, 2, 4), (1, 3, 4) and (4, 5, 6) are found to be appear equal or more than two in input data, which can be comfirmed manually.
The core part of Apriori for each iteration is join, filter algorithms. I implemented them without optimization but sketched its spirit.
I am diving into join algorithm a little bit.

Join takes in a set of tuples of length n. It first permutes all combinations of any two tuples. Let’s say tuple A and tuple B in consideration, if they differ in only 1 item, then tuple C of length n+1 is constructed and tuple C should contain all elements of A and B. For instance, tuple A = (1, 3, 4) and tuple B = (2, 3, 4), because A and B has only one different element so tuple C of length 4 can be constructed, C = (1, 2, 3, 4). However, the candidate tuple C must meet another criteria that all its low order tuples must be in current matched set. That is (1, 2, 3), (1, 2, 4), (1, 3, 4), (2, 3, 4) must in current matched set.

TF-IDF (Java)

Algorithm Origin: Coursera: Mining of Massive Datasets
Code at GitHub TF-IDF

TF-IDF Concept

TF-IDF, short for term frequency–inverse document frequency, that is intended to reflect how important a word is to a document in a collection. Based on TF-IDF, it would be convenient to compute document similarity or automated document abstraction.

The intuition is that

  1. The more often a word appears in a doc, the more importance it is than other words. This is term frequency part of TF-IDF.
  2. But not all words are created equal. We need to weigh each word by inverse document frequency which gives common words low weight.

TF-IDF Formula

There are several variants of TF and IDF formula in Wikipedia TF-IDF. In this article and Java code, we adopted TF to be
`tf(t,d) = f_(t,d) / |d|`, where `f_(t,d)` is number of term `t` in doc `d`, `|d|` is total number of terms in doc `d`.
`idf(t, D) = log(1 + N / n_t)`, where `N` is number of all docs and `n_t` is number of docs containing term `t`.
`tfidf(t,d,D) = tf(t,d) * idf(t,D)`

Justification of IDF

The probability that a given document `d` contains a term `t` as
`P(t|d)=|{d in D:t in d}|/N`
So define idf as
`idf=-log*P(t|d) = log N/|{d in D:t in d}|`

Code

public class TF_IDF {

    int numOfWords;
    double[] idfVector;
    double[][] tfIdfMatrix;
    double[][] tfMatrix;
    String[] wordVector;
    int docLength[];

    public TF_IDF(String[] docs) {
        // STEP 1, scan all words and count the number of different words
        // mapWordToIdx maps word to its vector index
        HashMap<String, Integer> mapWordToIdx = new HashMap<>();
        int nextIdx = 0;
        for (String doc : docs) {
            for (String word : doc.split(" ")) {
                if (!mapWordToIdx.containsKey(word)) {
                    mapWordToIdx.put(word, nextIdx);
                    nextIdx++;
                }
            }
        }

        numOfWords = mapWordToIdx.size();

        // STEP 2, create word vector where wordVector[i] is the actual word
        wordVector = new String[numOfWords];
        for (String word : mapWordToIdx.keySet()) {
            int wordIdx = mapWordToIdx.get(word);
            wordVector[wordIdx] = word;
        }

        // STEP 3, create doc word vector where docCountVector[i] is number of
        // docs containing word of index i
        // and doc length vector
        int[] docCountVector = new int[numOfWords];
        docLength = new int[docs.length];
        // lastDocWordVector is auxilary vector keeping track of last doc index
        // containing the word
        int[] lastDocWordVector = new int[numOfWords];
        for (int wordIdx = 0; wordIdx < numOfWords; wordIdx++) {
            lastDocWordVector[wordIdx] = -1;
        }
        for (int docIdx = 0; docIdx < docs.length; docIdx++) {
            String doc = docs[docIdx];
            String[] words = doc.split(" ");
            for (String word : words) {
                docLength[docIdx] = words.length;
                int wordIdx = mapWordToIdx.get(word);
                if (lastDocWordVector[wordIdx] < docIdx) {
                    lastDocWordVector[wordIdx] = docIdx;
                    docCountVector[wordIdx]++;
                }
            }
        }

        // STEP 4, compute IDF vector based on docCountVector
        idfVector = new double[numOfWords];
        for (int wordIdx = 0; wordIdx < numOfWords; wordIdx++) {
            idfVector[wordIdx] = Math.log10(1 + (double) docs.length / (docCountVector[wordIdx]));
        }

        // STEP 5, compute term frequency matrix, tfMatrix[docIdx][wordIdx]
        tfMatrix = new double[docs.length][];
        for (int docIdx = 0; docIdx < docs.length; docIdx++) {
            tfMatrix[docIdx] = new double[numOfWords];
        }
        for (int docIdx = 0; docIdx < docs.length; docIdx++) {
            String doc = docs[docIdx];
            for (String word : doc.split(" ")) {
                int wordIdx = mapWordToIdx.get(word);
                tfMatrix[docIdx][wordIdx] = tfMatrix[docIdx][wordIdx] + 1;
            }
        }
        // normalize idfMatrix by deviding corresponding doc length
        for (int docIdx = 0; docIdx < docs.length; docIdx++) {
            for (int wordIdx = 0; wordIdx < numOfWords; wordIdx++) {
                tfMatrix[docIdx][wordIdx] = tfMatrix[docIdx][wordIdx] / docLength[docIdx];
            }
        }

        // STEP 6, compute final TF-IDF matrix
        // tfIdfMatrix[docIdx][wordIdx] = tfMatrix[docIdx][wordIdx] * idfVector[wordIdx]
        tfIdfMatrix = new double[docs.length][];
        for (int docIdx = 0; docIdx < docs.length; docIdx++) {
            tfIdfMatrix[docIdx] = new double[numOfWords];
        }

        for (int docIdx = 0; docIdx < docs.length; docIdx++) {
            for (int wordIdx = 0; wordIdx < numOfWords; wordIdx++) {
                tfIdfMatrix[docIdx][wordIdx] = tfMatrix[docIdx][wordIdx] * idfVector[wordIdx];
            }
        }

    }

    public double[][] getTF_IDFMatrix() {
        return tfIdfMatrix;
    }

    public String[] getWordVector() {
        return wordVector;
    }

}

The test case is taken from Wikipedia TF-IDF where
d1 = “this is a a sample” and d2=”this is another another example example example”
Result of sorted TF-IDF values for d2 are
[example, this, is, another]
[0.20448053773699817, 0.13632035849133212, 0.043004285094854454, 0.043004285094854454]

Minhashing (Java)

Algorithm Origin: Coursera: Mining of Massive Datasets week 2
Code is at GitHub Minhashing

Jaccard Similarity of Two Sets

Let `A` and `B` are two sets, the Jaccard similarity is
`J(A,B) = |A nn B|/|A uu B|`

For example, 4 sets illustrated below

Element S1 S2 S3 S4
a 1 0 0 1
b 0 0 1 0
c 0 1 0 1
d 1 0 1 1
e 0 0 1 0

`J(S_1, S_3) = |{d}|/|{a, b, d, e}| = 1/4`

Minhashing and Jaccard Similarity

Because each set may have millions of possible items, computing exact Jaccard similarity would be huge effort. Hence, if there would be a much shorter vector for each set, the estimated Jaccard similarity could be significantly simplified.
We select a random permutation of items and define minhash for that permutation to be the first item that exists in the set. For the example above, a randomly generated permutation h is selected to be b,e,a,d,c, then `h(S_1)=a`.

Minhashing and Jaccard Similarity

The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets.
How to interpret this? Let’s see all permutations of `S_2` and `S_4` in the example above. The items involved in `S_2` and `S_4` are a, c and d.

Element S2 S4
a 0 1
c 1 1
d 0 1

All permutations are

Permutation h `h(S_2) = h(S_4)` ?
adc false
acd false
cad true
cda true
dac false
dca false

`P(h(S_2)=h(S_4))` is 1/3, which is also Jaccard similarity.

Computing Minhash Signatures from Multiple Hash Functions

The actual minhashing signature is computed using some approximations.
1. Sample several permutations but not full permutations.
2. Use a hash function to simulate each permutation. The hash function may map some pairs of integers to the same bucket and
leave other buckets unfilled. However, the difference is unimportant as long as there are not too many collisions. (Illustrated in implementation although I have not figured out mathematical correctness of this).

public Minhashing(int[][] sets, Function<Integer, Integer>[] hashs) {
    int setNumber = sets.length;
    // use Integer[] so that infinity can be represented by null
    sig = new Integer[setNumber][];
    for (int i = 0; i < setNumber; i++) {
        sig[i] = new Integer[hashs.length];
    }

    // for each row index value
    for (int i = 0; i < sets[0].length; i++) {
        // calc hash vector 
        int[] hashVector = new int[hashs.length];
        for (int hashIdx = 0; hashIdx < hashs.length; hashIdx++) {
            hashVector[hashIdx] = hashs[hashIdx].apply(i);
        }

        // for each set signature vector
        for (int setIdx = 0; setIdx < setNumber; setIdx++) {
            for (int hashIdx = 0; hashIdx < hashs.length; hashIdx++) {
                if (sets[setIdx][i] != 0) {
                    if (sig[setIdx][hashIdx] == null || sig[setIdx][hashIdx] > hashVector[hashIdx]) {
                        sig[setIdx][hashIdx] = hashVector[hashIdx];
                    }
                }
            }
        }
    }
}

The running result of the example, with `h_1=(x+1) mod 5` and `h_2=(3*x+1) mod 5` is

S1 S2 S3 S4
h1 1 3 0 1
h2 0 2 0 0

KMP (Java)

The KMP (Knuth-Morris-Pratt) string match algorithm is implemented according to Wikipedia KMP. Code @ github

KMP Table

  1. T[i] means how many chars till i-th char (exclusively) match char sequence of W[0], W[1], …
  2. if T[i] > 0, it also means W[i] corresponds to W[T[i]]
  3. T[0] is always -1, meaning it’s starting element.
  4. T[1] is always 0 because it does not make sense only T[0] is considered.

For example, for the following W, pos 6, AB is char sequence ending with W[5] that matches W[0], W[1], so T[6]=2

i 0 1 2 3 4 5 6
W[i] A B C D A B D
T[i] -1 0 0 0 0 1 2

Another example is

i 0 1 2 3 4 5 6 7
W[i] A B C A B A B C
T[i] -1 0 0 0 1 2 1 2

Construct KMP Table

Code implemented according to Wikipedia pseudocode

  • pos: index into T to compute T[pos]. Actually only need to compare W[pos-1] with W[cnd]
  • cnd: candidate index into T where W[cnd-1], W[cnd-2],…, have been matched.
  • case 1: W[pos-1]==W[cnd], the substring continues. Increment pos to compute next.
  • case 2: W[pos-1]!=W[cnd] but cnd>0. It indicates that although W[pos-1] mismatches with W[cnd] but W[cnd] is end of sub sequence of W[0], W[1], …, W[T[cnd]], so we fall back cnd to T[cnd] to try luck again.
  • case 3: cnd=0, W[pos-1] does not even match W[0], T[pos] is thus 0.
table = new int[strNeedle.length()];
table[0] = -1;
table[1] = 0;
int pos = 2;
int cnd = 0;
while (pos < strNeedle.length()) {
    if (strNeedle.charAt(pos - 1) == strNeedle.charAt(cnd)) {
        // case 1: the substring continues
        cnd++;
        table[pos] = cnd;
        pos++;
    } else if (cnd > 0) {
        // case 2: it doesn't, but we can fall back
        cnd = table[cnd];
    } else {
        // case 3: we have run out of candidates.  Note cnd = 0
        table[pos] = 0;
        pos++;
    }
}

Match Algorithm

Code implemented according to Wikipedia pseudocode

  • m: starting index into strHayStack where a matching sequence begins
  • i: index into strNeedle. Compare strNeedle[i] with strHayStack[m+i]
  • case 1: strNeedle[i]==strHayStack[m+i]
  • case 2: strNeedle[i]!=strHayStack[m+i], reposition starting index m to fall back
  • case 3: table[i]==0, already strNeedle[0] and mismatch, next m
public int match(String strHayStack) {
    int m = 0;
    int i = 0;
    while (m + i < strHayStack.length()) {
        if (strNeedle.charAt(i) == strHayStack.charAt(m + i)) {
            // case 1: the substring continues
            if (i == strNeedle.length() - 1) {
                return m;
            }
            i++;
        } else {
            if (table[i] > -1) {
                // case 2: fall back to shorter prefix
                m = m + i - table[i];
                i = table[i];
            } else {
                // case 3: mismatch and table[i]==0
                i = 0;
                m = m + 1;
            }
        }
    }
    return -1;
}

Batch Gradient Descent of Linear Regression (Python)

Originated from Coursera Machine Learning, week 2.
Code at https://github.com/MyEncyclopedia/AlgImpl/tree/master/02_Machine_Learning/Linear_Regression

Hypothesis

Vector `X=[1, x_1, x_2, x_3, …, x_n]` is input data

Linear hypothesis is

`h_theta(x) = theta_0 + theta_1(x_1) + theta_2(x_2) + … + theta_n(x_n)`

Define cost function to be

`J(theta)=1/(2m)Sigma_(i=1)^m(h_theta(x^((i)))-y^((i)))^2`

The task is to find `theta` that minimizes `J`
Gradient descent method is to find converged `theta` iteratively.
The partial derivative of `J(theta)` is `1/(2m)Sigma_(i=1)^m(h_theta(x^((i)))-y^((i)))*x_(j)^((i))`

Therefore, iteration equation is
`theta_j=theta_j – alpha*1/(2m)Sigma_(i=1)^m(h_theta(x^((i)))-y^((i)))*x_(j)^((i))`

Feature Scaling

Get every feature into approximately [-1, 1] range.
`x_i = (x_i-mu_i)/ delta_i` where `mu_i` is mean of `i`-th feature and `delta_i` is standard deviation of `i`-th feature.

Python Code

import numpy as np

class GradientDescent:
    "Batch Gradient Descent method of linear regression"
    alpha = 0.01
    step = 100

    def __init__(self, alpha=0.01, step=100):
        self.alpha = alpha
        self.step = step

    def distance(self, v1, v2):
        v = v1 - v2
        v = v * v
        return np.sum(v)


    def h(self, X, theta):
        n = len(theta)
        return np.dot(theta, X)


    def cost(self, X):
        c = np.sum(X*X)
        m = len(X)
        c = 1.0/(2*m) * c
        return c


    def featureNormalize(self, X):
        "Get every feature into approximately [-1, 1] range."
        featureN = len(X)
        mu = np.mean(X[0:featureN + 1], axis=1, dtype=np.float64)
        sigma = np.std(X[0:featureN + 1], axis=1, dtype=np.float64)
        X_norm = X
        for i in range(featureN):
            X_norm[i] = (X[i]-mu[i])/sigma[i]
        return X_norm, mu, sigma

    def iterate(self, data, toNormalize=False):
        m = len(data)
        data = np.transpose(data)
        featureN = len(data) - 1
        Y = data[featureN]  # shape 1 * m
        X = data[0:featureN]  # shape N * m
        if toNormalize:
            X, mu, sigma = self.featureNormalize(X)
        X = np.vstack((np.ones(m), X))  # shape (N+1) * m

        theta = np.zeros(featureN + 1, dtype=np.float64)

        for iter in range(self.step):
            theta_temp = np.zeros(featureN + 1, dtype=np.float64)
            V = self.h(X, theta)
            V = V - Y
            costValue = self.cost(V)

            for i in range(featureN+1):
                theta_temp[i] = np.sum(V * X[i])
            theta_temp *= self.alpha / m
            theta_temp = theta - theta_temp
            # dist = distance(theta, theta_temp)
            theta = theta_temp
        return theta

The result running the code above with test data (in homework of course Machine Learning) is
ex1data1.txt: `theta=[-3.63029144, 1.16636235]`
ex1data2.txt: `theta=[ 338658.2492493, 103322.82942954, -474.74249522]`

PageRank algorithm using matrix power iteration (Python)

Algorithm Origin: Coursera: Mining of Massive Datasets week 1
Code is at https://github.com/MyEncyclopedia/AlgImpl/tree/master/01_Mining_of_Massive_Datasets/PageRank

Power Iteration Method

  1. Init: `r^((0))=[1/N, …1/N]^T`
  2. Iterate: `r^((t+1))=M * r^((t))`
  3. Stop when `|r^((t+1)) – r^((t))| < epsilon`

where `M_(ji)=1/d_i` if `i` points to `j`, else `M_(ji)=0`

Naive Implementation

import numpy as np

class PageRank_NoTeleport:
    "Power iteration but does not address Spider trap problem or Dead end problem "
    epsilon = 0.0001

    def __init__(self, beta=0.85, epsilon=0.0001):
        self.epsilon = epsilon

    def distance(self, v1, v2):
        v = v1 - v2
        v = v * v
        return np.sum(v)

    def compute(self, G):
        "G is N*N matrix where if j links to i then G[i][j]==1, else G[i][j]==0"
        N = len(G)
        d = np.zeros(N)
        for i in range(N):
            for j in range(N):
                if (G[j, i] == 1):
                    d[i] += 1

        r0 = np.zeros(N, dtype=np.float32) + 1.0 / N
        # construct stochastic M
        M = np.zeros((N, N), dtype=np.float32)
        for i in range(N):
            for j in range(N):
                if G[j, i] == 1:
                    M[j, i] = 1.0 / d[i]
        while True:
            r1 = np.dot(M, r0)
            dist = self.distance(r1, r0)
            if dist < self.epsilon:
                break
            else:
                r0 = r1

        return r1

The naive version solves normal matrix M like the following graph
regular
in case that
`G=((0,1,1),(1,1,0),(1,0,0)) => M=((0,0.5,1),(0.5,0.5,0),(0.5,0,0))`
The result is [0.40293884, 0.39887747, 0.1981837]

But cannot address Dead Ends problem (see picture below) or Spider trap problem
DeadEnds
in case that
`G=((0,1,0),(1,1,0),(1,0,0)) => M=((0,0.5,0),(0.5,0.5,0),(0.5,0,0))`
With the naive implementation, we get wrong result, [0.01896159, 0.03068034, 0.01171875]

Teleport Version

The Google Matrix `A`: `A=beta*M + (1-beta)*(1/n) e * e^T` where `e` is vector of all 1s.
The iteration becomes `r^(t+1) = A * r^t`

import numpy as np

class PageRank_Matrix:
    "Power iteration with random teleports that addresses Spider trap problem or Dead end problem "
    beta = 0.85
    epsilon = 0.0001

    def __init__(self, beta=0.85, epsilon=0.0001):
        self.beta = beta
        self.epsilon = epsilon

    def distance(self, v1, v2):
        v = v1 - v2
        v = v * v
        return np.sum(v)

    def compute(self, G):
        "G is N*N matrix where if j links to i then G[i][j]==1, else G[i][j]==0"
        N = len(G)
        d = np.zeros(N)
        for i in range(N):
            for j in range(N):
                if (G[j, i] == 1):
                    d[i] += 1
            if d[i]==0:   # i is dead end, teleport always
                d[i] = N

        r0 = np.zeros(N, dtype=np.float32) + 1.0 / N
        # construct stochastic M
        M = np.zeros((N, N), dtype=np.float32)
        for i in range(N):
            if (d[i]==N):  # i is dead end
                for j in range(N):
                    M[j, i] = 1.0 / d[i]
            else:
                for j in range(N):
                    if G[j, i] == 1:
                        M[j, i] = 1.0 / d[i]

        T = (1.0 - self.beta) * (1.0 / N) * (np.zeros((N, N), dtype=np.float32) + 1.0)
        A = self.beta * M +  T
        while True:
            r1 = np.dot(A, r0)
            dist = self.distance(r1, r0)
            if dist < self.epsilon:
                break
            else:
                r0 = r1

        return r1

In this version, `beta` is introduced to reflect the possibility of teleporting. In addition, we preprocessed matrix `M` to make it stochastic for dead end nodes. The result is [0.30623639, 0.43920633, 0.25455755].

Powered by WordPress & Theme by Anders Norén