RC: W6 D1 — Format of Data Blocks

March 18, 2024

So far, I have implemented all the in-memory parts of my LSM engine. If I want this to be durable, I need to store the data on disk at some point. In LSM storage engines, data is stored on disk in so-called “Sorted String Tables” (SSTables). One SSTable contains the content of one memtable and, because entries in memtables are sorted by key, entries in SSTable are also sorted (hence their name). Each SSTable is composed of multiple data blocks, each of which contains a certain number of entries. So today’s task was to implement the generation of these blocks and I paired with Ludwig on this.

First of all, the size of one block is that of the underlying hardware: for SSDs, it corresponds to the size of one page (i.e. the smallest writable unit). Unfortunately, there is no easy way to access the page size of the SSD of my Mac (it would require looking up the manufacturer’s documentation, but Apple does not publish detailed technical documentation about the internal components of their devices). So instead, we looked up my OS’s block size for file I/O operations (i.e. the size of blocks used by the file system for reading from and writing to a disk) – which I believe should be somehow related to the page size underneath. We can access this information with the command: stat -f %k. In my case, I got 65,536 bytes. Noting that 64KB is bigger than the usual sizes reported for SSD pages (4K to 16K). I assume that this is because the OS writes multiple SSD pages at once (I really don’t know, but that could be an explanation). And for lack of anything better, I will use that value.

In each data block, we need to store n records (i.e. key-value pairs). Also, in order to retrieve them easily, we need to store their offset position within the block. One possible encoding is the following:

+--------------------+-----------------------------+-------+
|       Records      |             Meta            | Extra |
+--------------------+-----------------------------+-------+
| R1 | R2 | ... | Rn | offset_R1 | ... | offset_Rn | nb_R  |
+--------------------+-----------------------------+-------+
(R = Record)

Note: Alternatively, the number of records and offsets could be stored first. It would be fairly similar, except that this would require incrementing all offsets by the size of the header (i.e. the size of the offsets list) before flushing the block to disk.

When decoded, we need to read the last bytes first to know how many records there are, then read the metadata to figure out the offsets of the records, and last, read the searched record by positioning the pointer at the corresponding offset.

Here is the code we came up with:

import struct

INT_H_SIZE = 2  # an "H"-formatted integer in `struct` is 2-bytes long


class Block:
    def __init__(self, data: bytes, offsets: list[int]):
        self.data = data
        self.offsets = offsets

    @property
    def number_records(self) -> int:
        return len(self.offsets)

    def to_bytes(self) -> bytes:
        offset_bytes = struct.pack("H" * len(self.offsets), *self.offsets)
        number_records = struct.pack("H", self.number_records)

        return self.data + offset_bytes + number_records

    @classmethod
    def from_bytes(cls, data: bytes) -> "Block":
        # Decode number of records
        nb_records_offset = len(data) - INT_H_SIZE
        nb_records = struct.unpack("H", data[nb_records_offset:])[0]

        # Decode offsets
        offsets_start = nb_records_offset - (nb_records * INT_H_SIZE)
        offsets_format = "H" * nb_records
        offsets = list(struct.unpack(offsets_format,
                                     data[offsets_start:nb_records_offset]))

        # Decode records
        encoded_records = data[0:offsets_start]

        return cls(data=encoded_records, offsets=offsets)

One interesting thing to note is that the integers corresponding to offsets can be formatted as “H” (“half”), i.e. only 2 bytes instead of the usual 4 bytes. Indeed, the size of a block is 65,536 bytes, which is exactly equal to (2**8)*(2**8) (i.e. the maximum unsigned number we can encode with 2 bytes). Thus, we have a slightly more compact way to store all the required information in our data block.