A Simple Bloom Filter

It is a probabilistic data structure, that is commonly used to ask question “whether an element is in it or not?”. From my description, it does seems like a job that Set can do, sooo…

Why Should We Use It?

The major advantage of Bloom Filter is space complexity, while other data structures, such as Set and HashTable requires to store the data itself, Bloom Filter does not requires data storing.

When Should We Use It?

Although Bloom filter has major space complexity advantage, yet it can only answer in probabilistic term: whether an item is “possibly in the filter” or “not a chance in the filter”. Because of this characteristic of Bloom filter, you will only want to use it if :

You need better space complexity.
You can tolerate false positives.
You would not need to delete items from the filter. Simple bloom filter cannot support removal operation.

How Should We Implement It?

There are a couple of components to implement the bloom filter.

A Simple Bloom Filter

import sys
import math
# using murmur hashing function because it returns integer
import mmh3

# use bitarray to manage our main data structure
from bitarray import bitarray

class SimpleBloomFilter(object):
    def __init__(self, hash_fn, m, fpr):
        self._fpr = fpr
        self._hash_fn = hash_fn
        
        # size of bitarray that we should construct
        self.size = int(-(m * math.log(fpr)) / (math.log(2) ** 2))
        
        # k = how many time should we do the hash
        self._k = math.ceil((m/self.size) * math.log(2))
        
        # construct and initialize the bit array
        self._bit_array = bitarray(self.size)
        self._bit_array.setall(0)
        
    def add(self, item):

        for i in range(self._k):
            digest = self._hash_fn.hash(item, i) % self.size
            # set the bit into 1
            self._bit_array[digest] = 1
    
    def is_possibly_exists(self, item):
        for i in range(self._k):
            digest = self._hash_fn.hash(item, i) % self.size
            # check if the digest is in bit array
            # if there is one 0 found in bit_array, then
            # the item does not exists at all
            if not self._bit_array[digest]:
                return False
            
        return True
    

After we have constructed the simple bloom filter class, we are now ready to test this out!

test_elements = [
    'hello world',
    'hello 1',
    'hello3'
]
bf = SimpleBloomFilter(mmh3, len(test_elements), 0.01)
for el in test_elements:
    bf.add(el)
    
for el in test_elements:
    print('is `{}` exists in the filter? {}'.format(el, bf.is_possibly_exists(el)))

test1 = 'not exists'
print('is `{}` exists in the filter? {}'.format(test1, bf.is_possibly_exists(test1)))

is `hello world` exists in the filter? True
is `hello 1` exists in the filter? True
is `hello3` exists in the filter? True
is `not exists` exists in the filter? False

It turns out working like we are expecting! Hurray! Now we want to see how much space that we save for implementing the bloom filter

print('size of test elements: {} bytes'.format(sys.getsizeof(test_elements)))
print('size of bloom filters: {} bytes'.format(sys.getsizeof(bf)))

size of test elements: 88 bytes
size of bloom filters: 56 bytes

It doesn’t seems to be much of saving, eh? Let’s simulate it with larger elements!

bigger

import hashlib

def simulate_bloom_filter_size(element_size):
    # generate larger test elements
    test_elements = list()
    test_element_length = element_size
    bf = SimpleBloomFilter(mmh3, test_element_length, 0.01)

    for i in range(test_element_length):
        str_generator = hashlib.md5('{}'.format(i).encode())
        el = str_generator.hexdigest()
        test_elements.append(el)
    
    return (sys.getsizeof(test_elements), sys.getsizeof(bf._bit_array), bf)

el_size = 10000
print('With {} Elements'.format(el_size))
print('------------------------------------------')
data_size, bf_size, _ = simulate_bloom_filter_size(el_size)
print('size of test elements: {:,} bytes'.format(data_size))
print('size of bloom filters: {:,} bytes'.format(bf_size))

With 10000 Elements
------------------------------------------
size of test elements: 87,624 bytes
size of bloom filters: 96 bytes

more

Okay we got this, but we want to know more! what if we extends until 100000?

import numpy as np
import matplotlib.pyplot as plt

observation_len = 50
c = np.linspace(1, 100000, observation_len)
data_sizes = list()
bf_sizes = list()
for el_size in c:
    data_size, bf_size, bf = simulate_bloom_filter_size(int(el_size))
    data_sizes.append(data_size)
    bf_sizes.append(bf_size)

f, ax = plt.subplots(figsize=(11, 9))
plt.title('Data vs Bloom Filter Size in Bytes')
plt.plot(np.arange(observation_len), data_sizes, label='Data Size')
plt.plot(np.arange(observation_len), bf_sizes, label='Bloom Filter Size')
plt.legend()
plt.show()

png

print('largest data size: {:,} bytes'.format(data_sizes[-1]))
print('largest bloom filter size: {:,} bytes'.format(bf_sizes[-1]))

largest data size: 824,464 bytes
largest bloom filter size: 96 bytes

Conclusion

Bloom Filter is awsome! (if you have matched the condition to use it!)

awesome

References

https://en.wikipedia.org/wiki/Bloom_filter
https://stackoverflow.com/questions/11669178/how-to-create-an-array-of-bits-in-python
https://www.geeksforgeeks.org/bloom-filters-introduction-and-python-implementation/
https://stackoverflow.com/questions/449560/how-do-i-determine-the-size-of-an-object-in-python

Written on May 28, 2018