Month: December 2016

Bloom Filter

Bloom Filter is a probabilistic set data structure used to test whether an element is in a set or not.  It has the advantage of fast operation over traditional set data structures but at cost of sacraficing correctness: false positive matches are possible, meaning that a query returns either “possibly in set” or “definitely not in set”. A bloom filter leverages predetermined k hash functions, each of which maps to a location of the underlying bit array for every item added. It’s like sampling of each element and loses precision during the process. Natrually, only add operation is supported but not remove. Not surprisingly, the more elements added into the set, the larger probability of false positives.

To pick up k independent while uniformly distributed hash functions is non trial. But a simplified version of Java implementation described in the blog adopts double hashing strategy, code available @ my github Bloom Filter.

Elements of Implementation

  • The actual data structure representing a bloom filter is a bit array, or in Java, the BitSet of length bitSize.
  • To obtain k distint hash values for each item using double hashing
    `h_i(key) = h_1(key) + i*h_2(key)`

Probabilistic relations

  • Optimal number of hash functions (in Wikipedia) with given input entries and bits size, which would make the false positive probability lowest
    `k=(bitSize)/(estimatedEntries)*ln2`
  • Optimal bits size with given input entries and expected false positive probability
    `bitSize=-(estimatedEntries*ln(fpp))/(ln2)^2`
  • False positive probability based on given input entries and bits size, corresponding optimal number of hash function k
    `E|fpp| = (1-1/(bitSize))^(k*estimatedEntries)`

The github project also includes a test that computes real false positive probability of this implementation.

Core Code Illustrated

    public BloomFilter(int expectedEntries, int byteSize) {
        if (!(expectedEntries > 0)) {
            throw new IllegalArgumentException("expectedEntries should be > 0");
        }
        this.expectedEntries = expectedEntries;
        this.numHashFunctions = optimalNumOfHashFunctions(expectedEntries, byteSize << 3);
        this.bitSet = new BitSet(byteSize * 8);
    }

    public void addHash(int hash32) {
        int hash1 = hash32;
        int hash2 = hash32 >>> 16;

        for (int i = 1; i <= numHashFunctions; i++) {
            int combinedHash = hash1 + (i * hash2);
            // hashcode should be positive, flip all the bits if it's negative
            if (combinedHash < 0) {
                combinedHash = ~combinedHash;
            }
            int pos = combinedHash % bitSet.size();
            bitSet.set(pos);
        }
    }

    public boolean testHash(int hash32) {
        int hash1 = hash32;
        int hash2 = hash32 >>> 16;

        for (int i = 1; i <= numHashFunctions; i++) {
            int combinedHash = hash1 + (i * hash2);
            // hashcode should be positive, flip all the bits if it's negative
            if (combinedHash < 0) {
                combinedHash = ~combinedHash;
            }
            int pos = combinedHash % bitSet.size();
            if (!bitSet.get(pos)) {
                return false;
            }
        }
        return true;
    }

Screen snapshot in ME-Mydoccode_bloomfilter

AsyncHbase Client Code Demo

Although Hbase has intrinsic non async client API, an async client API derived from OpenTSDB OpenTSDB/asynchbase outperforms the former.
In this blog, we demonstrate its basic put and scan async API which return Deferred object, the equivalent one to Java 8 CompletableFuture.  The code demo does the following 3 operations in sequential order.

  1. Verify specified table and column family of a hbase cluster
  2. Puts several number of records asynchronously
  3. After put operations succeed, perform a total scan
    public static Deferred putData() throws Exception {
        String rowKey = UUID.randomUUID().toString();
        String data = "value " + counter.incrementAndGet();
        PutRequest putRequest = new PutRequest(tableName.getBytes(),
                rowKey.getBytes(), columnFamily.getBytes(),
                qualifier.getBytes(),
                data.getBytes());
        return hBaseClient.put(putRequest).addCallbacks(
                arg -> {
                    System.out.println("put data: rowkey=" + rowKey + ", value=" + data);
                    return null;
                },
                new Callback<Object, Exception>() {
            @Override
            public Object call(Exception arg) throws Exception {
                arg.printStackTrace();
                return null;
            }
        });

    }

To coordinate the order of verify phase and put phase, CountDownLatch is used whereas Deferred.group for put and scan coordination.

        List<Deferred<Object>> putDeferredList = new ArrayList<>();
        int total = 100;
        for (int i = 0; i < total; i++) {
            putDeferredList.add(putData());
        }
        Deferred.group(putDeferredList).addCallback(
                cb -> {
                    System.out.println("All put finished");
                    startProcessScan();
                    return null;
                }
        );

The maven project is downloadable @ hbaseasyncclientdemo.

Screen snapshot of the code in ME-Mydoc
code_asynchbase

Powered by WordPress & Theme by Anders Norén