RC: W2 D3 — Build a generator to decode a file chunk by chunk
February 21, 2024One of the limitations of an append-only storage engine is that it uses a lot of disk space. Indeed, since you only ever append, many overridden (and thus deprecated) key-value pairs remain on disk, whereas we will never read their values again.
To overcome this disk-usage problem, Bitcask performs a “merge” process. This basically consists in:
- reading a bunch of already-written (and immutable) files
- keeping only the most up-to-date value for each key (that is: the value that was written last for that key)
- write this down in a new “merged” file
- discard all previous files.
While I was in the middle of implementing this, I stumbled upon some more encoding/decoding problems.
In a nutshell, the key-value items that are written to files have been encoded. So far, each item contains:
- a timestamp (stored as a 32-bit integer)
- the key size (stored as a 32-bit integer)
- the value size (stored as a 32-bit integer)
- the key (encoded in UTF-8)
- the value (encoded in UTF-8)
The trick is that each of those entries may have a different size, depending on that of the key and of the value.
More precisely, the exact size for each entry is: 4 + 4 + 4 + key_size + value_size
bytes.
Thus, in order to read and decode all the entries stored in each file, I need to read and decode the next chunk (i.e. number of bytes corresponding to each item). I ended up doing this as follows:
import os
class StoredItem:
key: str
value: bytes
size: int
@classmethod
def from_bytes(cls, data):
"""A class method that decodes a stream of bytes (data) into a StoredItem
"""
# Hiding the implementation to show only details relevant to this post
return cls()
class File:
path: str
def __iter__(self):
"""Overriding the __iter__ dunder method to read all stored_items
"""
file_size = os.path.getsize(self.path)
with open(self.path, "rb") as file:
data = file.read()
offset = 0
while offset < file_size: # Stop generating when end of file reached
chunk = data[offset:] # Slice the data stream (only next chunk)
stored_item = StoredItem.from_bytes(chunk) # decode the chunk
chunk_size = stored_item.size
offset += chunk_size
yield stored_item
(This chunk of code is not complete: I only reported the details that are relevant to this article.)
Note that this implementation has one limitation: it stores the whole file content in memory (data = file.read()
).
I do this because I don’t know how many bytes I will need to read before decoding them (because each stored item’s has
a different size). Once I have finished decoding the chunk, I know its size and can move the cursor by the same number
of bytes for the next iteration (offset += stored_item.size
).
The yield
keyword is used to return a generator which has the interesting property of generating the values on the
fly (instead of storing them in memory), thus saving RAM usage.
To make it even cleaner, I overrode the __iter__
dunder method (“dunder” = “double underscore”) of the File
class.
Now, we can iterate on all stored_items
contained in the file as follows:
file = File()
for stored_item in file:
# process stored_item
...
Overall, this is a rather elegant way of reading our encoded file chunk by chunk.
Hopefully, tomorrow, I will be able to finally implement the merge worker fully 🤞