RC: W7 D4 — Building an intuition to fine-tune bloom filter parameters

March 28, 2024

The point of creating bloom filters is to be able to know which SSTable may contain a particular record. Now that I have implemented bloom filters, I need to associate one to each SSTable. This implies two things:

Keeping the bloom filter associated to each SSTable in memory, so as to know if it is worth opening and reading the SSTable file when searching for a particular record.
Writing the bloom filter to disk together with the SSTable: this will be done in the “metadata” section of the encoded SSTable (this will be useful when reading from an existing store).

I will leave the second bullet point aside for now (I will discuss it in tomorrow’s post). For today, let’s dive into the first one!

First of all, the bloom filters for each SSTable will not all have the same characteristics: indeed, since records are written as compactly as possible in each SSTable, the number of records written into each SSTable may vary a bit (the smaller the records, the more it contains). Therefore, the number of keys in each bloom filter will also vary across bloom filters). Aside from the number of elements, the other characteristics of a bloom filter are its size (number of bits), the number of hash functions it uses and the false positive rate (FP rate). Ideally, the latter should be identical across all bloom filters so that the load is distributed uniformly across all SSTables (if the FP rate was higher in some, then there would be more disk I/O in those SSTables). I believe that, in production systems, we might want to tune the FP rate depending on access patterns, but for now, I will keep with this simple assumption.

Using these parameters, we can compute the optimal size (i.e. number of bits, \(m\)) of the bloom filter as a function of the number of items (\(n\)) and of the FP rate (\(p\)):

\[m = {-n * log(p) \over (log(2)^2)}\]

The mathematical proof for the formulas can be found on wikipedia, but we can understand it intuitively:

The more items in the bloom filter (\(n\)), the larger it needs to be to maintain a specific FP rate (because each item will turn a subset of bits on. So, if the size is fixed, adding more items will end in saturating the filter).
The smaller the FP rate (\(p\)), the larger the bloom filter needs to be. More specifically, as \(p\) decreases, we need to increase the amount of space exponentially, not just linearly, hence the logarithm. And since the FP rate is a probability and thus bounded between 0 and 1, \(log(p)\) is negative, so we take the negative of that (\(-log(p)\))
Dividing this by \((log(2)^2)\) stems from another formula that gives the optimal number of hash functions for a fixed \(m\) and \(n\) (more on this below). This denominator is basically here to normalize the formula considering the optimal number of hash functions.

For a given size (\(m\)) and number of items (\(n\)), we can also compute the optimal number of hash functions (\(k\)) that will minimize the FP rate (\(p\)). Obtaining the formula is rather straightforward and I will describe it briefly hereafter (go to wikipedia for more details). Since each hash function selects every bit with equal probability, the probability that a given bit has not been set by any of the k hash functions when inserting one element is: \((1 - {1 \over m})^k\). Since n elements are inserted, the probability that a bit has not been set becomes \((1 - {1 \over m})^{kn}\). From this, the FP rate (i.e. the probability that all k bits computed by the k hash functions are all 1) is: \((1 - (1 - {1 \over m})^{kn})^k\). For large \(m\), this can be approximated as \((1 - e^{ {kn} \over m })^k\) and from there, we can extract the optimal number of hash functions to be:

\[k = { m \over n } log(2)\]

Again, we can build an intuition for this formula:

The more items we add (\(n\)), the fewer hash functions we should use. Indeed, for a fixed size \(m\), the risk of collisions will grow with the number of hash functions (the array of bits will be saturated at some point).
But also, the bigger the size of the bloom filter (\(m\)), the more hash functions we should use. That is because more hash functions allow to keep the FP rate low longer (the more hash functions, the less likely that two items have the same key hashes).

Overall, this formula balances the need to maintain both a low risk of collision (smaller \(k\) when \(n\) increases) and a low FP rate (bigger \(k\) when \(m\) increases). On a side note, the \(log(2)\) in this formula is where the denominator comes from in the first one.

Well… That was a big chunk of math!

Now, let’s turn to what matters for my SSTables. The typical size for an SSTable is 256MB. Let’s assume that the average size for a record is 100B (it would typically be a few tens of bytes for a text key value pair, but could be much more than that if we are storing json objects or images for example). That would give an average 2.5 million records per SSTable. Using the formula above, for an expected FP rate of 0.1% (one in a thousand) the optimal size of the associated bloom filter is 36 million bits, which comes back to 4.5MB. Even if we had, say, as many as a thousand SSTables, the total size of the bloom filters would be 4.5GB, which is totally manageable to store in memory.

With all this, I thus decided to tune my bloom filter to have an FP rate of 0.1%. I added a build method to my bloom filters to build it based on the computed optimal parameters:

from math import log, ceil


class BloomFilter:
    def __init__(self, nb_bytes: int, nb_hash_functions: int):
        pass

    def add(self, key: str):
        pass

    @classmethod
    def build(cls, keys: list[str], fp_rate: float) -> "BloomFilter":
        # Compute optimal parameters
        n = len(keys)
        m = (-n * log(fp_rate)) / (log(2) ** 2)
        k = (m / n) * log(2)

        # Instantiate bloom filter with optimal parameters
        bloom_filter = cls(nb_bytes=ceil(m / 8), nb_hash_functions=round(k))

        # Add all keys to bloom filter
        for key in keys:
            bloom_filter.add(key)

        return bloom_filter

And now, every time I create a new SSTable, I can build the corresponding bloom filter with it with a fixed FP rate!