RC: W7 D5 — Manipulating bits to serialize my bloom filters

March 29, 2024

Now that I have a bloom filter and know how to fine-tune it, I need to write it to disk together with my SSTable. I thus decided to change the encoding of my SSTable to include it as such:

+-----------------------+---------------------------+--------------+----------------------------+
|      Data Blocks      |     Meta (Data Blocks)    | Meta (Bloom) |            Extra           |
+-----------------------+---------------------------+--------------+----------------------------+
| DB1 | DB2 | ... | DBn | meta_DB1 | ... | meta_DBn | bloom filter | meta_offset | bloom_offset |
+-----------------------+---------------------------+--------------+----------------------------+
(DB = Data Block)

So the only thing left to do is to serialize bloom filters. Each bloom filter stores the data in an array of bits. When the latter is stored in memory as a Python object, it corresponds to an integer. Given the calculations made in yesterday’s post, it can be up to 36 million bits. This made me wonder about the possibility of overflow errors, so I checked and found that, in Python, there is actually no overflow error on integers because they are implemented as “long” integer objects of arbitrary size. Therefore, even if the integer is one with up to 36 million bits, the array of bits should remain correct when manipulated in memory, and I don’t need to think about the number of bytes it will take.

However, when encoding it to serialize it to disk, this is no longer true: I need to know how many bytes will be used to encode it so that I can decode it later. I thus decided to break my array of \(n\) bits into \(n \over 8\) 1-byte encoded integers.

To understand how to do this, let’s consider the following 16-bit bloom filter:

| 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |

To find out the two side-by-side 8-bit integers that correspond to this 16-bit array integer, I can use a bitmask. Bitmasking is a technique used for bitwise operations to turn certain bits on or off. In my case, I need to break my array into 8-bits integers, so I can use a mask containing 8 bits set to 1 and perform a bitwise AND operation to compute the correct number. Here is how it can be done:

                                <------ PORTION TO ENCODE ------>
| 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | <-- original
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | <-- 0xFF mask

------------------------- AND OPERATION -------------------------

                                | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | <-- result (8 bits)

So the right-most integer is 1 (encoded as 00000001).

<------ PORTION TO ENCODE ------>
| 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | <-- original
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |                                 <-- 0xFF mask

------------------------- AND OPERATION -------------------------

| 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1                                   <-- result (8 bits)

The second right-most integer is 75 (encoded as 01001011).

I can then write these two integers (75 and 1) to disk side-by-side.

When decoding, I can later recreate the full array by decoding each 1-byte integer one-by-one and recreate the full array of bits from there. Hereunder is the full code for encoding and decoding my bloom filters:

import struct


class BloomFilter:
    def to_bytes(self) -> bytes:
        # 0xFF is 255 (mask to get the lowest byte)
        bytes_to_pack = tuple((self.bits >> (8 * i)) & 0xFF
                              for i in range(self.nb_bytes))
        encoded_bits = struct.pack("B" * self.nb_bytes, *bytes_to_pack)
        encoded_nb_hash_functions = struct.pack("B", len(self.hash_functions))
        return encoded_bits + encoded_nb_hash_functions

    @classmethod
    def from_bytes(cls, data: bytes) -> "BloomFilter":
        nb_bytes = len(data) - 1
        nb_hash_functions = struct.unpack("B", data[nb_bytes:])[0]
        unpacked_bytes = struct.unpack("B" * nb_bytes, data[:nb_bytes])
        bits = sum(byte << (8 * i) for i, byte in enumerate(unpacked_bytes))

        return cls(nb_bytes=nb_bytes,
                   nb_hash_functions=nb_hash_functions,
                   bits=bits)

My bloom filters can now be serialized together with my SSTables!