Bloom Filter

Probabilistic Data Structure To Check Membership

False positives are possible. it might say, element might exist, even thought it isn’t.

But when it doesn’t exist, it would never say it exists.

Ref to https://systemdesign.one/bloom-filters-explained/

Requirements

Design a data structure with the following characteristics:

constant time complexity to test membership
a small amount of memory to test membership
insert and query operations should be parallelizable
test membership can yield approximate results

Insertion

the item is hashed through k hash functions
the modulo n (length of bit array) operation is executed on the output of the hash functions to identify the k array positions (buckets)
the bits at all identified buckets are set to one

There is a probability that some bits on the array are set to one multiple times due to hash collisions.

Membership test

Verify if the hash bits at all identified buckets are set to one

False positive case

Bloom filter calculator

The hur.st bloom filter calculator can be used to choose an optimal size for the bloom filter and explore how the different parameters interact. The accuracy of the bloom filter depends on the following:

number of hash functions (k)
quality of hash functions
length of the bit array (n)
number of items stored in the bloom filter

The properties of an optimal hash function for the bloom filter are the following:

fast
independent
uniformly distributed

Summary

Ref: https://ricardoanderegg.com/posts/bloom-filters-poster/

Applications

For example, one use case of Bloom filters is the following: you have a huge list of malicious URLs. In your browser, before a user navigates to a new URL, you want to check if it’s inside the list of dangerous URLs. You can use a bloom filter to do that! It will take less space than saving the full list of URLs, and if the answer from the Bloom filter is “no” (the URL is not a malicious one), you can safely let the user visit it

Reducing disk lookups for the non-existing keys.

Log-structured merge tree (LSM) storage engine used in databases such as Cassandra uses a bloom filter to check if the key exists in the SSTable

Bloom filter disadvantages

The limitations of the bloom filter are the following:

bloom filter doesn’t support the delete operation
- Use counting bloom filter, instead of bit set
false positives rate can’t be reduced to zero
bloom filter on disk requires random access due to random indices generated by hash functions

Removing an item from the bloom filter is not supported because there is no possibility to identify the bits that should be cleared. There might be other items that map onto the same bits and clearing any of the bits would introduce the possibility of false negatives.

Extensions of bloom filter

The counting bloom filter includes an array of counters for each bit in the bloom filter array. The counting bloom filter supports the delete operation.

Counting Bloom filters are much bigger than classical Bloom filters because additional memory has to be allocated for the counters even if most of them are zeros.