RC: W6 D2 — Generating Data Blocks
March 19, 2024We know that blocks are supposed to be the size of an SSD page (cf yesterday’s post), and they are supposed to contain a certain number of records (that may vary between blocks, depending on the size of the records). Now the question is: how to generate the blocks so that they are never bigger than the expected block size ?
I looked at the source code of RocksDB, LevelDB and mini-LSM and they all have two classes related to generating blocks:
- one
Block
class that handles encoding/decoding data - one
BlockBuilder
class that handles generating those blocks.
The BlockBuilder
class contains a buffer whose size is that expected for the block (64KB
in my case) and an add
method that adds a single record to the current block, provided that there is still room for it.
Here is what my implementation looks like:
class Record:
Key = str
Value = bytes
def __init__(self, key: Key, value: Value):
self.key = key
self.value = value
def to_bytes(self) -> bytes:
pass
class Block:
def __init__(self, data: bytes, offsets: list[int]):
self.data = data
self.offsets = offsets
class BlockBuilder:
def __init__(self, block_size: int):
self.block_size = block_size
self.offsets = []
self.data_buffer = bytearray(self.block_size)
self.data_length = 0
def add(self, key: Record.Key, value: Record.Value) -> bool:
encoded_record = Record(key=key, value=value).to_bytes()
size = len(encoded_record)
# Compute the new expected offset
current_offset = self.data_length
new_offset = current_offset + size
# Do not add record to current block if size bigger than expected
if new_offset > self.block_size:
return False
# Otherwise (there is still room), add the record to the current block
self.offsets.append(current_offset)
self.data_buffer[current_offset:new_offset] = encoded_record
self.data_length += size
return True
def create_block(self) -> Block:
# Truncating the data to self.data_length (all bytes after are empty)
return Block(data=bytes(self.data_buffer[:self.data_length]),
offsets=self.offsets)
So the idea is: before adding a new record, we can compute the next expected offset. If it goes beyond the target size of the block, then we don’t add the record to that block. Otherwise, the record gets added to the current block (the encoded record is added to the buffer, and the current offset is added to the list of offsets).
All in all, the records are stored in the buffer in an encoded way and the list of offsets is not encoded. What’s interesting with having records stored in their raw binary format is that this reduces memory overhead. If instead, they were stored in a structure wrapping each key-value pair (like a list for example), more memory would be needed to store the associated metadata (like the length and capacity of the list). To convince oneself about that, we can look at the size of an empty stream of bytes, versus that of an empty list:
import sys
# Empty stream of bytes
sys.getsizeof(b'') # 33 bytes
# Empty list
sys.getsizeof([]) # 64 bytes
The list indeed takes up more space than the simple stream of bytes. Also, they may be faster to process given that raw bytes in a pre-allocated buffer are stored contiguously in memory whereas a list in python stores references to the actual elements. Thus, accessing the data in a list likely takes more time than reading a stream of bytes when we know where is the starting offset.
Another important aspect is that, when the block is full (i.e. no additional record can be added to it), there will
likely be a few bytes that remain empty (a number of bytes smaller than the size of the next record).
Since the Block
is then decoded starting from the end to get the number of records and all the offsets, it is
important to know exactly the number of bytes that have been filled.
That’s why in create_block
(the method that actually creates the Block
python object), I truncate the buffer down to
the number of bytes that was actually filled.