RC: W2 D2 — Encoding records
February 20, 2024I paired today with both David and Pierre-Antoine and we had some very interesting discussions about how to write the data I have for each item on disk.
Each item is a key-value pair. So I need to store at least the key and the value on disk. In addition, since the storage engine I am implementing is append-only, any more recent value for a given key will replace the previous value. Thus, I also want to store a timestamp to keep this information (this may be overkill, because by definition, any row written further in the file is more recent than previous rows, but I will keep the timestamp anyway to know when the last update was). Last, I will keep the sizes of the key and the value, so that I know exactly the length of each item stored on disk. These last three portions (timestamp, size of key, size of value) should be stored as 32-bit integers.
Schematically, each stored item should look like this:
+-----------+----------+------------+---------------+-----------------+
| timestamp | key_size | value_size | key | value |
+-----------+----------+------------+---------------+-----------------+
| 32 bits | 32 bits | 32 bits | key_size bits | value_size bits |
+-----------+----------+------------+---------------+-----------------+
One last thing: we will assume that the key is always stored as a string. As for values, the storage engine should be able to store either strings, integers, floats, images, videos or anything else really. We want the storage engine to be agnostic of the type of data the value represents. Therefore, we should assume that the value is passed as a stream of bytes to the storage engine. Said differently, the storage engine’s API should expect a value passed in bytes.
Now that we have the full picture of each stored item, we are going to need to encode each portion accordingly: the timestamp, key size and value size as three 32-bit integers, then the key as a string encoded to UTF-8, then the value written as is (because it is already a stream of bytes).
Also, to avoid leaking abstractions, Pierre-Antoine very rightly advised me to encapsulate all of this logic in a separate class. Here is what it looks like:
import struct
NB_BYTES_INTEGER = 4
ENCODING = "utf-8"
class StoredItem:
def __init__(self, key: str, value: bytes, timestamp: int):
self.key = key
self.value = value
self.timestamp = timestamp
self.key_size = len(self.key)
self.value_size = len(self.value)
def to_bytes(self) -> bytes:
encoded_metadata = struct.pack("iii", self.timestamp, self.key_size, self.value_size)
encoded_key = bytes(self.key, encoding=ENCODING)
encoded_value = self.value
return encoded_metadata + encoded_key + encoded_value
@classmethod
def from_bytes(cls, data: bytes) -> "StoredItem":
# metadata_offset is the number of bytes expected in the metadata
metadata_offset = 3 * NB_BYTES_INTEGER
timestamp, key_size, value_size = struct.unpack("iii", data[:metadata_offset])
key = str(data[metadata_offset: metadata_offset + key_size], encoding=ENCODING)
value = data[metadata_offset + key_size: metadata_offset + key_size + value_size]
return cls(key=key, value=value, timestamp=timestamp)
This implementation provides:
- a
to_bytes
method to encode all the portions of aStoredItem
into a stream of bytes - a
from_bytes
method to decode an encoded stream of bytes into aStoredItem
In particular, something that was not straightforward to me: we can concatenate portions that use different types of
encoding (struct.pack("iii", data)
to encode 3 integers, the key string encoded to “utf-8” and the value which is
already a stream of bytes) as long as we are able, afterwards, to decode all these portions accordingly (i.e. we need
to know how many bytes correspond to which portion, which is straightforward here, since the integers are 32-bit/4-byte
long and the key and value sizes are recorded).
That’s it for today! Looking forward to discovering what tomorrow will bring!