RC: W7 D2 — Bloom filters!
March 26, 2024So far, when we search for a key in the store, we need to look at every single SSTable until one contains the record. This is costly, because that means we are going to read many blocks from disk only to find out the record is not there and proceed to the next SSTable.
We can do this more efficiently by using “bloom filters”. A bloom filter is a space-efficient probabilistic data structure that is able to say if an item may be in collection or is assuredly not in it (that is why it is probabilistic: it doesn’t say “yes” but “maybe”). For our use case, if we have a bloom filter associated to each SSTable, before looking for the record in the SSTable, we can query the bloom filter: if it says “no”, then we know there is no need to read the SSTable, and if it says “maybe”, then we read it and look for the record. By essence, the bloom filter will give false positives (it sometimes says “maybe” while the item is not in the collection). But this is okay, because we can fine tune it to have only, say, one false positive in a thousand (0.1%) or one in a million (0.0001%) for example. The advantages are that 1) the information is stored in a very limited space (so it is not very costly memory-wise) and 2) in our case, it will prevent us from reading many SSTables uselessly. All of this gain for the small cost of having one false positive once in a while.
How do these data structures work under the hood?
A bloom filter is made of an array of bits.
Every time we add an item to the collection, we hash it with a given hash function and then take the modulo of the
resulting hash by the size of the sequence of bits (so that hashed values are distributed over this range).
The bit corresponding to the hashed item is set.
The process is repeated for n
hash functions. Thus, about n
bits are set for every item included in the
collection (or possibly fewer if two hash functions turn the same bit on).
For example, if our bloom filter contains 16 bits, and we want to add “foo” and “bar” to the collection with 3 hash functions, here is what it would do:
Original bloom filter (every bit set to 0):
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Adding "foo":
- hash function 1: hash1("foo") % 8 --> 11
- hash function 2: hash2("foo") % 8 --> 14
- hash function 3: hash3("foo") % 8 --> 11
=> Set the bits 11 and 14:
14 11
| |
v v
| 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Adding "bar":
- hash function 1: hash1("bar") % 8 --> 0
- hash function 2: hash2("bar") % 8 --> 9
- hash function 3: hash3("bar") % 8 --> 8
=> Set the bits 0, 9, and 8
9 8 0
| | |
v v v
| 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Then, to know if an item is in the collection, we need to hash it by all hash functions and check the status of all those bits in the sequence:
- If they are all set, then the item may be in the collection;
- If at least one of them is not set, then we are guaranteed that the item is not in the collection.
Checking "bar":
- hash function 1: hash1("bar") % 8 --> 0
- hash function 2: hash2("bar") % 8 --> 9
- hash function 3: hash3("bar") % 8 --> 8
Check bits:
9 8 0
| | |
v v v
| 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
=> All 3 bits (0, 9, and 8) are set => "bar" _may_ be in the collection
Checking "baz":
- hash function 1: hash1("baz") % 8 --> 8
- hash function 2: hash2("baz") % 8 --> 6
- hash function 3: hash3("baz") % 8 --> 1
Check bits:
8 6 1
| | |
v v v
| 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
=> Bits 1 and 6 are not set => "baz" is _not_ in the collection
All in all, these data structures are both super powerful and very easy to understand and implement. I will use them here for my lookups in my SSTables, but they are also used in other use cases (like web caching, spam filtering, network security, or spell checkers for example). If you are interested in knowing more about those, this blog post provides very detailed and intuitive explanations about bloom filters with very beautiful graphics.
Today’s task was to implement bloom filters. I paired with Laurent on this. Here is what we came up with to create this data structure:
import mmh3
class BloomFilter:
def __init__(self,
nb_bytes: int,
nb_hash_functions: int,
bits: Optional[int] = None):
self.nb_bytes = nb_bytes
self.bits_size = 8 * nb_bytes
self.nb_hash_functions = nb_hash_functions
self.bits = bits if bits else 0
def _hash(self, key: str) -> list[int]:
encoded_key = key.encode(encoding="utf-8")
selected_bits = []
for i in range(self.nb_hash_functions):
hashed_key = mmh3.hash(encoded_key, i)
selected_bit = hashed_key % self.bits_size
selected_bits.append(selected_bit)
return selected_bits
def _set_bit(self, bit_index: int) -> None:
bit = (1 << bit_index)
self.bits |= bit
def _is_bit_set(self, bit_index: int) -> bool:
bit = (1 << bit_index)
return (self.bits & bit) == bit
def add(self, key: str) -> None:
"""Adds a key to the bloom filter.
"""
bits_to_set = self._hash(key=key)
for bit in bits_to_set:
self._set_bit(bit_index=bit)
def may_contain(self, key: str) -> bool:
"""Returns :
- True if the key may be in the bloom filter,
- False if it is guaranteed _not_ to be in it.
"""
bits_to_check = self._hash(key=key)
for bit in bits_to_check:
if not self._is_bit_set(bit_index=bit):
return False
return True
The code is pretty straightforward: we used the mmh3
library that provides a set
of hash functions to select nb_hash_functions
with which to hash each record being either inserted or looked up.
Our bloom filter two public methods: add
to insert a record and may_contain
to look it up.
Whenever inserting a record, we hash it and set the corresponding bits to 1
.
After implementing the bloom filters, we could attach one to each SSTable and update the read path so that we check the potential presence of a record in the bloom filter before reading the SSTable file, thus avoiding unnecessary disk lookups.