RC: W6 D2 — Generating Data Blocks | Software under the hood

RC: W6 D2 — Generating Data Blocks

March 19, 2024

We know that blocks are supposed to be the size of an SSD page (cf yesterday’s post), and they are supposed to contain a certain number of records (that may vary between blocks, depending on the size of the records). Now the question is: how to generate the blocks so that they are never bigger than the expected block size ?

I looked at the source code of RocksDB, LevelDB and mini-LSM and they all have two classes related to generating blocks:

one Block class that handles encoding/decoding data
one BlockBuilder class that handles generating those blocks.

The BlockBuilder class contains a buffer whose size is that expected for the block (64KB in my case) and an add method that adds a single record to the current block, provided that there is still room for it. Here is what my implementation looks like:

class Record:
    Key = str
    Value = bytes

    def __init__(self, key: Key, value: Value):
        self.key = key
        self.value = value

    def to_bytes(self) -> bytes:
        pass


class Block:
    def __init__(self, data: bytes, offsets: list[int]):
        self.data = data
        self.offsets = offsets


class BlockBuilder:
    def __init__(self, block_size: int):
        self.block_size = block_size
        self.offsets = []
        self.data_buffer = bytearray(self.block_size)
        self.data_length = 0

    def add(self, key: Record.Key, value: Record.Value) -> bool:
        encoded_record = Record(key=key, value=value).to_bytes()
        size = len(encoded_record)

        # Compute the new expected offset
        current_offset = self.data_length
        new_offset = current_offset + size

        # Do not add record to current block if size bigger than expected
        if new_offset > self.block_size:
            return False

        # Otherwise (there is still room), add the record to the current block
        self.offsets.append(current_offset)
        self.data_buffer[current_offset:new_offset] = encoded_record
        self.data_length += size

        return True

    def create_block(self) -> Block:
        # Truncating the data to self.data_length (all bytes after are empty)
        return Block(data=bytes(self.data_buffer[:self.data_length]),
                     offsets=self.offsets)

So the idea is: before adding a new record, we can compute the next expected offset. If it goes beyond the target size of the block, then we don’t add the record to that block. Otherwise, the record gets added to the current block (the encoded record is added to the buffer, and the current offset is added to the list of offsets).

All in all, the records are stored in the buffer in an encoded way and the list of offsets is not encoded. What’s interesting with having records stored in their raw binary format is that this reduces memory overhead. If instead, they were stored in a structure wrapping each key-value pair (like a list for example), more memory would be needed to store the associated metadata (like the length and capacity of the list). To convince oneself about that, we can look at the size of an empty stream of bytes, versus that of an empty list:

import sys

# Empty stream of bytes
sys.getsizeof(b'')  # 33 bytes

# Empty list
sys.getsizeof([])  # 64 bytes

The list indeed takes up more space than the simple stream of bytes. Also, they may be faster to process given that raw bytes in a pre-allocated buffer are stored contiguously in memory whereas a list in python stores references to the actual elements. Thus, accessing the data in a list likely takes more time than reading a stream of bytes when we know where is the starting offset.

Another important aspect is that, when the block is full (i.e. no additional record can be added to it), there will likely be a few bytes that remain empty (a number of bytes smaller than the size of the next record). Since the Block is then decoded starting from the end to get the number of records and all the offsets, it is important to know exactly the number of bytes that have been filled. That’s why in create_block (the method that actually creates the Block python object), I truncate the buffer down to the number of bytes that was actually filled.