Bloom Filter – Space/Time Trade-offs in Hash Coding with Allowable Errors

What is bloom filter?

  • the main idea is to use bit array in fixed size to check if a key belong to a set of keys. The array size is much smaller then the size of the key set. The algorithms will identify for sure if a key does not belong to the set but might have false positive response for belonging to the set. the largest the array the lower the chance for false positive.

what is its usage?

  • powerful for checking if a key belong to a set of keys where the key space is very large but the set is relatively small. For example, assume we want to write an application that does word hyphenation. 90% of english words have the same simple rule for hyphenation  while the rest have special complicated rules. We can use a filter that check for the 10% words in an efficient way.
  • used by web caches for ignoring one time hits. According to wikipedia 75% of the cached URLs are single hit URLs. In order to ignore one time, web caches check if its bloom filter might contain a new URL if not then it means this is first hit and so it does not add it to the cache and update the bloom filter. next time the same URL will be hit, then it will be found in the bloom filter and so will be added to the cache filter.
  • can be used to reduce work load from the server by delegating work to the browsers: lets say you have a big list that must be handled by a server, you can generate a bloom filter for it and have the client store it. in case the key might be in the list, then a more accurate examination in the server side can be done. e.g., it can be used to manage URL black lists in the browsers.
  • used by Cassandra to check if a key exist in a table file

Bloom algorithm details

  • inserting a key from the set to the bit array:
    • define a bit array with size M.
    • choose K different hash functions that take as input a key from the set and map it to a value in the range {1,..,M}. we will consider the hash result as an index location in the array.
    • apply each hash function on the key. this will produce K values in the range {1,..M}. Set the value in the array-index j in the array if j is one of the K’s hash functions results i.e., there will be maximum of K bits set ‘1’ for a key.
    • repeat the process for each key. For each key – more bits will be set in the array.
    • we will end up with a single bit array A representing the set.
  • Testing if a key belong to the set:
    • process the key by each of the K hash functions, for each hash function we will have a subset L of the set {1,..M}. verify that all bits are set in A in locations of L e.g., if L contains ‘3’, we need to check that the 3rd bit in A is set.
    • If for one of the K hash functions – the result is N but the N-th bit is not set in A then we can be sure that the key is not belong to the set.
    • If all K checkes pass then the key might belong to the set but there is an chance of false positive.

References