RC: W2 D3 — Build a generator to decode a file chunk by chunk

February 21, 2024

One of the limitations of an append-only storage engine is that it uses a lot of disk space. Indeed, since you only ever append, many overridden (and thus deprecated) key-value pairs remain on disk, whereas we will never read their values again.

To overcome this disk-usage problem, Bitcask performs a “merge” process. This basically consists in:

reading a bunch of already-written (and immutable) files
keeping only the most up-to-date value for each key (that is: the value that was written last for that key)
write this down in a new “merged” file
discard all previous files.

While I was in the middle of implementing this, I stumbled upon some more encoding/decoding problems.

In a nutshell, the key-value items that are written to files have been encoded. So far, each item contains:

a timestamp (stored as a 32-bit integer)
the key size (stored as a 32-bit integer)
the value size (stored as a 32-bit integer)
the key (encoded in UTF-8)
the value (encoded in UTF-8)

The trick is that each of those entries may have a different size, depending on that of the key and of the value. More precisely, the exact size for each entry is: 4 + 4 + 4 + key_size + value_size bytes.

Thus, in order to read and decode all the entries stored in each file, I need to read and decode the next chunk (i.e. number of bytes corresponding to each item). I ended up doing this as follows:

import os


class StoredItem:
    key: str
    value: bytes
    size: int

    @classmethod
    def from_bytes(cls, data):
        """A class method that decodes a stream of bytes (data) into a StoredItem
        """
        # Hiding the implementation to show only details relevant to this post
        return cls()  


class File:
    path: str

    def __iter__(self):
        """Overriding the __iter__ dunder method to read all stored_items
        """
        file_size = os.path.getsize(self.path)
        with open(self.path, "rb") as file:
            data = file.read()
            offset = 0
            while offset < file_size:  # Stop generating when end of file reached 
                chunk = data[offset:]  # Slice the data stream (only next chunk)
                stored_item = StoredItem.from_bytes(chunk)  # decode the chunk
                chunk_size = stored_item.size
                offset += chunk_size
                yield stored_item

(This chunk of code is not complete: I only reported the details that are relevant to this article.)

Note that this implementation has one limitation: it stores the whole file content in memory (data = file.read()). I do this because I don’t know how many bytes I will need to read before decoding them (because each stored item’s has a different size). Once I have finished decoding the chunk, I know its size and can move the cursor by the same number of bytes for the next iteration (offset += stored_item.size).

The yield keyword is used to return a generator which has the interesting property of generating the values on the fly (instead of storing them in memory), thus saving RAM usage.

To make it even cleaner, I overrode the __iter__ dunder method (“dunder” = “double underscore”) of the File class. Now, we can iterate on all stored_items contained in the file as follows:

file = File()
for stored_item in file:
    # process stored_item
    ...

Overall, this is a rather elegant way of reading our encoded file chunk by chunk.

Hopefully, tomorrow, I will be able to finally implement the merge worker fully 🤞