RC: W6 D4 — Creating SSTables
March 21, 2024Today, I paired with Jake to create SSTables (Sorted String Tables). Overall, the principle is very similar to that of creating blocks, and we thus implemented two distinct classes:
- one
SSTable
class that handles the encoding/decoding of SSTables - one
SSTableBuilder
class that handles the generation of SSTables
One SSTable is made of multiple data blocks, and we can use a format very similar to that of blocks:
+-----------------------+-------------------------------+-------+
| Data Blocks | Meta | Extra |
+-----------------------+-------------------------------+-------+
| DB1 | DB2 | ... | DBn | offset_DB1 | ... | offset_DBn | nb_DB |
+-----------------------+-------------------------------+-------+
The typical size of an SSTable is 256MB. Thus, the value of offsets can be up to 262,144,000. Thus, we need offsets to be stored as 32-bit (4-bytes) integers (contrary to record offsets in data blocks that fit in only 2-bytes integers).
When building the SSTable, we need to add the record to the correct data block. We can do this by trying to add it to the current block. In the case where there was no room left in the current block, we need to create the block (i.e. add the encoded record offsets to the raw records) and add this encoded block to the SSTable buffer. Once this is done, we can initialize the builder for the next block and add the record to it (since it was not added to the current block).
A few lines of code are better than a lengthy paragraph, so here is the implementation we came up with:
class Block:
@property
def size(self) -> int:
pass
def to_bytes(self) -> bytes:
pass
class BlockBuilder:
def __init__(self, block_size):
pass
def add(self, key, value) -> bool:
pass
def create_block(self) -> Block:
pass
class SSTable:
def __init__(self, data, offsets):
pass
class SSTableBuilder:
def __init__(self, sstable_size, block_size):
self.block_size = block_size
self.data_buffer = bytearray(sstable_size)
self.data_block_offsets = []
self.block_builder = BlockBuilder(block_size=self.block_size)
self.current_buffer_position = 0
def add(self, key, value):
was_added_to_current_block = self.block_builder.add(key=key, value=value)
# Nothing to do if the record was added to the block
if was_added_to_current_block:
return
# Otherwise, finalize block and add it to the buffer
block = self.block_builder.create_block()
encoded_block = block.to_bytes()
start = self.current_buffer_position
end = self.current_buffer_position + block.size
self.data_buffer[start:end] = encoded_block
# Create a new block
self.block_builder = BlockBuilder(block_size=self.block_size)
self.data_block_offsets.append(self.current_buffer_position)
self.current_buffer_position += block.size
# Add record to the new block
self.block_builder.add(key=key, value=value)
def build(self):
return SSTable(data=bytes(self.data_buffer[:self.current_buffer_position]),
offsets=self.data_block_offsets)
And there we have our SSTables built!