Bloom Filter is a probabilistic set data structure used to test whether an element is in a set or not.  It has the advantage of fast operation over traditional set data structures but at cost of sacraficing correctness: false positive matches are possible, meaning that a query returns either “possibly in set” or “definitely not in set”. A bloom filter leverages predetermined k hash functions, each of which maps to a location of the underlying bit array for every item added. It’s like sampling of each element and loses precision during the process. Natrually, only add operation is supported but not remove. Not surprisingly, the more elements added into the set, the larger probability of false positives.

To pick up k independent while uniformly distributed hash functions is non trial. But a simplified version of Java implementation described in the blog adopts double hashing strategy, code available @ my github Bloom Filter.

Elements of Implementation

  • The actual data structure representing a bloom filter is a bit array, or in Java, the BitSet of length bitSize.
  • To obtain k distint hash values for each item using double hashing
    `h_i(key) = h_1(key) + i*h_2(key)`

Probabilistic relations

  • Optimal number of hash functions (in Wikipedia) with given input entries and bits size, which would make the false positive probability lowest
    `k=(bitSize)/(estimatedEntries)*ln2`
  • Optimal bits size with given input entries and expected false positive probability
    `bitSize=-(estimatedEntries*ln(fpp))/(ln2)^2`
  • False positive probability based on given input entries and bits size, corresponding optimal number of hash function k
    `E|fpp| = (1-1/(bitSize))^(k*estimatedEntries)`

The github project also includes a test that computes real false positive probability of this implementation.

Core Code Illustrated

    public BloomFilter(int expectedEntries, int byteSize) {
        if (!(expectedEntries > 0)) {
            throw new IllegalArgumentException("expectedEntries should be > 0");
        }
        this.expectedEntries = expectedEntries;
        this.numHashFunctions = optimalNumOfHashFunctions(expectedEntries, byteSize << 3);
        this.bitSet = new BitSet(byteSize * 8);
    }

    public void addHash(int hash32) {
        int hash1 = hash32;
        int hash2 = hash32 >>> 16;

        for (int i = 1; i <= numHashFunctions; i++) {
            int combinedHash = hash1 + (i * hash2);
            // hashcode should be positive, flip all the bits if it's negative
            if (combinedHash < 0) {
                combinedHash = ~combinedHash;
            }
            int pos = combinedHash % bitSet.size();
            bitSet.set(pos);
        }
    }

    public boolean testHash(int hash32) {
        int hash1 = hash32;
        int hash2 = hash32 >>> 16;

        for (int i = 1; i <= numHashFunctions; i++) {
            int combinedHash = hash1 + (i * hash2);
            // hashcode should be positive, flip all the bits if it's negative
            if (combinedHash < 0) {
                combinedHash = ~combinedHash;
            }
            int pos = combinedHash % bitSet.size();
            if (!bitSet.get(pos)) {
                return false;
            }
        }
        return true;
    }

Screen snapshot in ME-Mydoccode_bloomfilter