RC: W6 D4 — Creating SSTables

March 21, 2024

Today, I paired with Jake to create SSTables (Sorted String Tables). Overall, the principle is very similar to that of creating blocks, and we thus implemented two distinct classes:

One SSTable is made of multiple data blocks, and we can use a format very similar to that of blocks:

+-----------------------+-------------------------------+-------+
|      Data Blocks      |              Meta             | Extra |
+-----------------------+-------------------------------+-------+
| DB1 | DB2 | ... | DBn | offset_DB1 | ... | offset_DBn | nb_DB |
+-----------------------+-------------------------------+-------+

The typical size of an SSTable is 256MB. Thus, the value of offsets can be up to 262,144,000. Thus, we need offsets to be stored as 32-bit (4-bytes) integers (contrary to record offsets in data blocks that fit in only 2-bytes integers).

When building the SSTable, we need to add the record to the correct data block. We can do this by trying to add it to the current block. In the case where there was no room left in the current block, we need to create the block (i.e. add the encoded record offsets to the raw records) and add this encoded block to the SSTable buffer. Once this is done, we can initialize the builder for the next block and add the record to it (since it was not added to the current block).

A few lines of code are better than a lengthy paragraph, so here is the implementation we came up with:

class Block:
    @property
    def size(self) -> int:
        pass

    def to_bytes(self) -> bytes:
        pass


class BlockBuilder:
    def __init__(self, block_size):
        pass

    def add(self, key, value) -> bool:
        pass

    def create_block(self) -> Block:
        pass


class SSTable:
    def __init__(self, data, offsets):
        pass


class SSTableBuilder:
    def __init__(self, sstable_size, block_size):
        self.block_size = block_size
        self.data_buffer = bytearray(sstable_size)
        self.data_block_offsets = []
        self.block_builder = BlockBuilder(block_size=self.block_size)
        self.current_buffer_position = 0

    def add(self, key, value):
        was_added_to_current_block = self.block_builder.add(key=key, value=value)

        # Nothing to do if the record was added to the block
        if was_added_to_current_block:
            return

        # Otherwise, finalize block and add it to the buffer
        block = self.block_builder.create_block()
        encoded_block = block.to_bytes()
        start = self.current_buffer_position
        end = self.current_buffer_position + block.size
        self.data_buffer[start:end] = encoded_block

        # Create a new block
        self.block_builder = BlockBuilder(block_size=self.block_size)
        self.data_block_offsets.append(self.current_buffer_position)
        self.current_buffer_position += block.size

        # Add record to the new block
        self.block_builder.add(key=key, value=value)

    def build(self):
        return SSTable(data=bytes(self.data_buffer[:self.current_buffer_position]),
                       offsets=self.data_block_offsets)

And there we have our SSTables built!