Algorithm Origin: Coursera: Mining of Massive Datasets

Code at GitHub Apriori

Apriori is algorithm for frequent item set mining. For example, given a item set where each line indicates a set of item.

1 2 4

1 3 4

1 4 5 6

1 2 3 4

2 3

4 5 6

7

The data given above has 7 item sets, the first one being items numbered 1, 2 and 4. Suppose we want to find all item sets having appearance of more than two, Apriori offers an iterative approach. First, Apriori computes all items having more than (or equal) appearance of two. And each later round result sets of length one more than current matched sets are computed.

For the example above, output of each round is listed below

Round 1

1,

2,

3,

4,

5,

6,

Round 2

1, 2,

1, 3,

1, 4,

2, 3,

2, 4,

3, 4,

4, 5,

4, 6,

5, 6,

Round 3

1, 2, 4,

1, 3, 4,

4, 5, 6,

In round 3, 3 item sets, **(1, 2, 4), (1, 3, 4) and (4, 5, 6) ** are found to be appear equal or more than two in input data, which can be comfirmed manually.

The core part of Apriori for each iteration is join, filter algorithms. I implemented them without optimization but sketched its spirit.

I am diving into join algorithm a little bit.

Join takes in a set of tuples of length n. It first permutes all combinations of any two tuples. Let’s say tuple A and tuple B in consideration, if they differ in only 1 item, then tuple C of length n+1 is constructed and tuple C should contain all elements of A and B. For instance, tuple A = (1, 3, 4) and tuple B = (2, 3, 4), because A and B has only one different element so tuple C of length 4 can be constructed, C = (1, 2, 3, 4). However, the candidate tuple C must meet another criteria that all its low order tuples must be in current matched set. That is (1, 2, 3), (1, 2, 4), (1, 3, 4), (2, 3, 4) must in current matched set.