Month: October 2015

TF-IDF (Java)

Algorithm Origin: Coursera: Mining of Massive Datasets
Code at GitHub TF-IDF

TF-IDF Concept

TF-IDF, short for term frequency–inverse document frequency, that is intended to reflect how important a word is to a document in a collection. Based on TF-IDF, it would be convenient to compute document similarity or automated document abstraction.

The intuition is that

  1. The more often a word appears in a doc, the more importance it is than other words. This is term frequency part of TF-IDF.
  2. But not all words are created equal. We need to weigh each word by inverse document frequency which gives common words low weight.

TF-IDF Formula

There are several variants of TF and IDF formula in Wikipedia TF-IDF. In this article and Java code, we adopted TF to be
`tf(t,d) = f_(t,d) / |d|`, where `f_(t,d)` is number of term `t` in doc `d`, `|d|` is total number of terms in doc `d`.
`idf(t, D) = log(1 + N / n_t)`, where `N` is number of all docs and `n_t` is number of docs containing term `t`.
`tfidf(t,d,D) = tf(t,d) * idf(t,D)`

Justification of IDF

The probability that a given document `d` contains a term `t` as
`P(t|d)=|{d in D:t in d}|/N`
So define idf as
`idf=-log*P(t|d) = log N/|{d in D:t in d}|`

Code

public class TF_IDF {

    int numOfWords;
    double[] idfVector;
    double[][] tfIdfMatrix;
    double[][] tfMatrix;
    String[] wordVector;
    int docLength[];

    public TF_IDF(String[] docs) {
        // STEP 1, scan all words and count the number of different words
        // mapWordToIdx maps word to its vector index
        HashMap<String, Integer> mapWordToIdx = new HashMap<>();
        int nextIdx = 0;
        for (String doc : docs) {
            for (String word : doc.split(" ")) {
                if (!mapWordToIdx.containsKey(word)) {
                    mapWordToIdx.put(word, nextIdx);
                    nextIdx++;
                }
            }
        }

        numOfWords = mapWordToIdx.size();

        // STEP 2, create word vector where wordVector[i] is the actual word
        wordVector = new String[numOfWords];
        for (String word : mapWordToIdx.keySet()) {
            int wordIdx = mapWordToIdx.get(word);
            wordVector[wordIdx] = word;
        }

        // STEP 3, create doc word vector where docCountVector[i] is number of
        // docs containing word of index i
        // and doc length vector
        int[] docCountVector = new int[numOfWords];
        docLength = new int[docs.length];
        // lastDocWordVector is auxilary vector keeping track of last doc index
        // containing the word
        int[] lastDocWordVector = new int[numOfWords];
        for (int wordIdx = 0; wordIdx < numOfWords; wordIdx++) {
            lastDocWordVector[wordIdx] = -1;
        }
        for (int docIdx = 0; docIdx < docs.length; docIdx++) {
            String doc = docs[docIdx];
            String[] words = doc.split(" ");
            for (String word : words) {
                docLength[docIdx] = words.length;
                int wordIdx = mapWordToIdx.get(word);
                if (lastDocWordVector[wordIdx] < docIdx) {
                    lastDocWordVector[wordIdx] = docIdx;
                    docCountVector[wordIdx]++;
                }
            }
        }

        // STEP 4, compute IDF vector based on docCountVector
        idfVector = new double[numOfWords];
        for (int wordIdx = 0; wordIdx < numOfWords; wordIdx++) {
            idfVector[wordIdx] = Math.log10(1 + (double) docs.length / (docCountVector[wordIdx]));
        }

        // STEP 5, compute term frequency matrix, tfMatrix[docIdx][wordIdx]
        tfMatrix = new double[docs.length][];
        for (int docIdx = 0; docIdx < docs.length; docIdx++) {
            tfMatrix[docIdx] = new double[numOfWords];
        }
        for (int docIdx = 0; docIdx < docs.length; docIdx++) {
            String doc = docs[docIdx];
            for (String word : doc.split(" ")) {
                int wordIdx = mapWordToIdx.get(word);
                tfMatrix[docIdx][wordIdx] = tfMatrix[docIdx][wordIdx] + 1;
            }
        }
        // normalize idfMatrix by deviding corresponding doc length
        for (int docIdx = 0; docIdx < docs.length; docIdx++) {
            for (int wordIdx = 0; wordIdx < numOfWords; wordIdx++) {
                tfMatrix[docIdx][wordIdx] = tfMatrix[docIdx][wordIdx] / docLength[docIdx];
            }
        }

        // STEP 6, compute final TF-IDF matrix
        // tfIdfMatrix[docIdx][wordIdx] = tfMatrix[docIdx][wordIdx] * idfVector[wordIdx]
        tfIdfMatrix = new double[docs.length][];
        for (int docIdx = 0; docIdx < docs.length; docIdx++) {
            tfIdfMatrix[docIdx] = new double[numOfWords];
        }

        for (int docIdx = 0; docIdx < docs.length; docIdx++) {
            for (int wordIdx = 0; wordIdx < numOfWords; wordIdx++) {
                tfIdfMatrix[docIdx][wordIdx] = tfMatrix[docIdx][wordIdx] * idfVector[wordIdx];
            }
        }

    }

    public double[][] getTF_IDFMatrix() {
        return tfIdfMatrix;
    }

    public String[] getWordVector() {
        return wordVector;
    }

}

The test case is taken from Wikipedia TF-IDF where
d1 = “this is a a sample” and d2=”this is another another example example example”
Result of sorted TF-IDF values for d2 are
[example, this, is, another]
[0.20448053773699817, 0.13632035849133212, 0.043004285094854454, 0.043004285094854454]

Minhashing (Java)

Algorithm Origin: Coursera: Mining of Massive Datasets week 2
Code is at GitHub Minhashing

Jaccard Similarity of Two Sets

Let `A` and `B` are two sets, the Jaccard similarity is
`J(A,B) = |A nn B|/|A uu B|`

For example, 4 sets illustrated below

Element S1 S2 S3 S4
a 1 0 0 1
b 0 0 1 0
c 0 1 0 1
d 1 0 1 1
e 0 0 1 0

`J(S_1, S_3) = |{d}|/|{a, b, d, e}| = 1/4`

Minhashing and Jaccard Similarity

Because each set may have millions of possible items, computing exact Jaccard similarity would be huge effort. Hence, if there would be a much shorter vector for each set, the estimated Jaccard similarity could be significantly simplified.
We select a random permutation of items and define minhash for that permutation to be the first item that exists in the set. For the example above, a randomly generated permutation h is selected to be b,e,a,d,c, then `h(S_1)=a`.

Minhashing and Jaccard Similarity

The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets.
How to interpret this? Let’s see all permutations of `S_2` and `S_4` in the example above. The items involved in `S_2` and `S_4` are a, c and d.

Element S2 S4
a 0 1
c 1 1
d 0 1

All permutations are

Permutation h `h(S_2) = h(S_4)` ?
adc false
acd false
cad true
cda true
dac false
dca false

`P(h(S_2)=h(S_4))` is 1/3, which is also Jaccard similarity.

Computing Minhash Signatures from Multiple Hash Functions

The actual minhashing signature is computed using some approximations.
1. Sample several permutations but not full permutations.
2. Use a hash function to simulate each permutation. The hash function may map some pairs of integers to the same bucket and
leave other buckets unfilled. However, the difference is unimportant as long as there are not too many collisions. (Illustrated in implementation although I have not figured out mathematical correctness of this).

public Minhashing(int[][] sets, Function<Integer, Integer>[] hashs) {
    int setNumber = sets.length;
    // use Integer[] so that infinity can be represented by null
    sig = new Integer[setNumber][];
    for (int i = 0; i < setNumber; i++) {
        sig[i] = new Integer[hashs.length];
    }

    // for each row index value
    for (int i = 0; i < sets[0].length; i++) {
        // calc hash vector 
        int[] hashVector = new int[hashs.length];
        for (int hashIdx = 0; hashIdx < hashs.length; hashIdx++) {
            hashVector[hashIdx] = hashs[hashIdx].apply(i);
        }

        // for each set signature vector
        for (int setIdx = 0; setIdx < setNumber; setIdx++) {
            for (int hashIdx = 0; hashIdx < hashs.length; hashIdx++) {
                if (sets[setIdx][i] != 0) {
                    if (sig[setIdx][hashIdx] == null || sig[setIdx][hashIdx] > hashVector[hashIdx]) {
                        sig[setIdx][hashIdx] = hashVector[hashIdx];
                    }
                }
            }
        }
    }
}

The running result of the example, with `h_1=(x+1) mod 5` and `h_2=(3*x+1) mod 5` is

S1 S2 S3 S4
h1 1 3 0 1
h2 0 2 0 0

KMP (Java)

The KMP (Knuth-Morris-Pratt) string match algorithm is implemented according to Wikipedia KMP. Code @ github

KMP Table

  1. T[i] means how many chars till i-th char (exclusively) match char sequence of W[0], W[1], …
  2. if T[i] > 0, it also means W[i] corresponds to W[T[i]]
  3. T[0] is always -1, meaning it’s starting element.
  4. T[1] is always 0 because it does not make sense only T[0] is considered.

For example, for the following W, pos 6, AB is char sequence ending with W[5] that matches W[0], W[1], so T[6]=2

i 0 1 2 3 4 5 6
W[i] A B C D A B D
T[i] -1 0 0 0 0 1 2

Another example is

i 0 1 2 3 4 5 6 7
W[i] A B C A B A B C
T[i] -1 0 0 0 1 2 1 2

Construct KMP Table

Code implemented according to Wikipedia pseudocode

  • pos: index into T to compute T[pos]. Actually only need to compare W[pos-1] with W[cnd]
  • cnd: candidate index into T where W[cnd-1], W[cnd-2],…, have been matched.
  • case 1: W[pos-1]==W[cnd], the substring continues. Increment pos to compute next.
  • case 2: W[pos-1]!=W[cnd] but cnd>0. It indicates that although W[pos-1] mismatches with W[cnd] but W[cnd] is end of sub sequence of W[0], W[1], …, W[T[cnd]], so we fall back cnd to T[cnd] to try luck again.
  • case 3: cnd=0, W[pos-1] does not even match W[0], T[pos] is thus 0.
table = new int[strNeedle.length()];
table[0] = -1;
table[1] = 0;
int pos = 2;
int cnd = 0;
while (pos < strNeedle.length()) {
    if (strNeedle.charAt(pos - 1) == strNeedle.charAt(cnd)) {
        // case 1: the substring continues
        cnd++;
        table[pos] = cnd;
        pos++;
    } else if (cnd > 0) {
        // case 2: it doesn't, but we can fall back
        cnd = table[cnd];
    } else {
        // case 3: we have run out of candidates.  Note cnd = 0
        table[pos] = 0;
        pos++;
    }
}

Match Algorithm

Code implemented according to Wikipedia pseudocode

  • m: starting index into strHayStack where a matching sequence begins
  • i: index into strNeedle. Compare strNeedle[i] with strHayStack[m+i]
  • case 1: strNeedle[i]==strHayStack[m+i]
  • case 2: strNeedle[i]!=strHayStack[m+i], reposition starting index m to fall back
  • case 3: table[i]==0, already strNeedle[0] and mismatch, next m
public int match(String strHayStack) {
    int m = 0;
    int i = 0;
    while (m + i < strHayStack.length()) {
        if (strNeedle.charAt(i) == strHayStack.charAt(m + i)) {
            // case 1: the substring continues
            if (i == strNeedle.length() - 1) {
                return m;
            }
            i++;
        } else {
            if (table[i] > -1) {
                // case 2: fall back to shorter prefix
                m = m + i - table[i];
                i = table[i];
            } else {
                // case 3: mismatch and table[i]==0
                i = 0;
                m = m + 1;
            }
        }
    }
    return -1;
}

Powered by WordPress & Theme by Anders Norén