Algorithm Origin: Coursera: Mining of Massive Datasets

Code at GitHub TF-IDF

# TF-IDF Concept

TF-IDF, short for term frequencyâ€“inverse document frequency, that is intended to reflect how important a word is to a document in a collection. Based on TF-IDF, it would be convenient to compute document similarity or automated document abstraction.

The intuition is that

- The more often a word appears in a doc, the more importance it is than other words. This is
**term frequency**part of TF-IDF. - But not all words are created equal. We need to weigh each word by
**inverse document frequency**which gives common words low weight.

# TF-IDF Formula

There are several variants of TF and IDF formula in Wikipedia TF-IDF. In this article and Java code, we adopted TF to be

`tf(t,d) = f_(t,d) / |d|`, where `f_(t,d)` is number of term `t` in doc `d`, `|d|` is total number of terms in doc `d`.

`idf(t, D) = log(1 + N / n_t)`, where `N` is number of all docs and `n_t` is number of docs containing term `t`.

`tfidf(t,d,D) = tf(t,d) * idf(t,D)`

# Justification of IDF

The probability that a given document `d` contains a term `t` as

`P(t|d)=|{d in D:t in d}|/N`

So define idf as

`idf=-log*P(t|d) = log N/|{d in D:t in d}|`

# Code

public class TF_IDF { int numOfWords; double[] idfVector; double[][] tfIdfMatrix; double[][] tfMatrix; String[] wordVector; int docLength[]; public TF_IDF(String[] docs) { // STEP 1, scan all words and count the number of different words // mapWordToIdx maps word to its vector index HashMap<String, Integer> mapWordToIdx = new HashMap<>(); int nextIdx = 0; for (String doc : docs) { for (String word : doc.split(" ")) { if (!mapWordToIdx.containsKey(word)) { mapWordToIdx.put(word, nextIdx); nextIdx++; } } } numOfWords = mapWordToIdx.size(); // STEP 2, create word vector where wordVector[i] is the actual word wordVector = new String[numOfWords]; for (String word : mapWordToIdx.keySet()) { int wordIdx = mapWordToIdx.get(word); wordVector[wordIdx] = word; } // STEP 3, create doc word vector where docCountVector[i] is number of // docs containing word of index i // and doc length vector int[] docCountVector = new int[numOfWords]; docLength = new int[docs.length]; // lastDocWordVector is auxilary vector keeping track of last doc index // containing the word int[] lastDocWordVector = new int[numOfWords]; for (int wordIdx = 0; wordIdx < numOfWords; wordIdx++) { lastDocWordVector[wordIdx] = -1; } for (int docIdx = 0; docIdx < docs.length; docIdx++) { String doc = docs[docIdx]; String[] words = doc.split(" "); for (String word : words) { docLength[docIdx] = words.length; int wordIdx = mapWordToIdx.get(word); if (lastDocWordVector[wordIdx] < docIdx) { lastDocWordVector[wordIdx] = docIdx; docCountVector[wordIdx]++; } } } // STEP 4, compute IDF vector based on docCountVector idfVector = new double[numOfWords]; for (int wordIdx = 0; wordIdx < numOfWords; wordIdx++) { idfVector[wordIdx] = Math.log10(1 + (double) docs.length / (docCountVector[wordIdx])); } // STEP 5, compute term frequency matrix, tfMatrix[docIdx][wordIdx] tfMatrix = new double[docs.length][]; for (int docIdx = 0; docIdx < docs.length; docIdx++) { tfMatrix[docIdx] = new double[numOfWords]; } for (int docIdx = 0; docIdx < docs.length; docIdx++) { String doc = docs[docIdx]; for (String word : doc.split(" ")) { int wordIdx = mapWordToIdx.get(word); tfMatrix[docIdx][wordIdx] = tfMatrix[docIdx][wordIdx] + 1; } } // normalize idfMatrix by deviding corresponding doc length for (int docIdx = 0; docIdx < docs.length; docIdx++) { for (int wordIdx = 0; wordIdx < numOfWords; wordIdx++) { tfMatrix[docIdx][wordIdx] = tfMatrix[docIdx][wordIdx] / docLength[docIdx]; } } // STEP 6, compute final TF-IDF matrix // tfIdfMatrix[docIdx][wordIdx] = tfMatrix[docIdx][wordIdx] * idfVector[wordIdx] tfIdfMatrix = new double[docs.length][]; for (int docIdx = 0; docIdx < docs.length; docIdx++) { tfIdfMatrix[docIdx] = new double[numOfWords]; } for (int docIdx = 0; docIdx < docs.length; docIdx++) { for (int wordIdx = 0; wordIdx < numOfWords; wordIdx++) { tfIdfMatrix[docIdx][wordIdx] = tfMatrix[docIdx][wordIdx] * idfVector[wordIdx]; } } } public double[][] getTF_IDFMatrix() { return tfIdfMatrix; } public String[] getWordVector() { return wordVector; } }

The test case is taken from Wikipedia TF-IDF where

d1 = “this is a a sample” and d2=”this is another another example example example”

Result of sorted TF-IDF values for d2 are

[example, this, is, another]

[0.20448053773699817, 0.13632035849133212, 0.043004285094854454, 0.043004285094854454]