RC: W9 D1 — Using event-driven patterns to rebuild the state from a manifest

April 8, 2024

The SSTables in an LSM engine are organized into several levels. This hierarchy of levels and the list of all SSTables that correspond to each is stored in the engine’s state, in memory only. So what would happen if the engine shut down? The SSTable files would still be present on disk, but there would be no way to know which belongs to each level any longer. To allow recovering the state at restart, LSM engines use a “manifest”: it is a file that logs all the operations that change the state. With this log, the operations can be replayed and the state recreated.

In my own implementation, so far, I have two operations that change it:

  1. The flush operation that writes a memtable to disk (thus creating an SSTable);
  2. The compaction operation that compacts and merges SSTables of one level to SSTables of the next level.

So every time one of these two operations is performed, I need to log it into the manifest. Later, when the system restarts, I will use the manifest to re-build the state.

Before implementing the code that would serialize events onto a manifest, I needed to design events that would capture all the information needed to allow to rebuild the state correctly. I paired with both Emily and Benjamin to do this. We wanted to write a function that would take a list of events and recreate the state from it. The principle is to replay the events one by one and to update the state as we go.

One easy way to do this would be to rewrite the full state to disk every time one of these events occur, but this would use a lot of disk space. To do it more efficiently, we wanted to design the events to contain only the minimal amount of information required to recreate it.

Flush events create a new SSTable on disk, and they will always be at level 0. So we only need to record the created SSTable:

class FlushEvent(Event):
    def __init__(self, sstable: SSTable):
        super().__init__()
        self.sstable = sstable

Compaction events are slightly more complex: they take n SSTables from a given level in input and recreate another set of m compacted SSTables positioned at the next level. We thus need to record these three pieces of information (level, input SSTables, output SSTables):

class CompactionEvent(Event):
    def __init__(self,
                 input_sstables: list[SSTable],
                 output_sstables: list[SSTable],
                 level: int):
        super().__init__()
        self.input_sstables = input_sstables
        self.output_sstables = output_sstables
        self.level = level

Next, we implemented a method that takes a list of such events as input and recreates the state:

from collections import deque


def reconstruct(self, nb_levels: int, events: list[Event]) -> LsmStorage:
    ss_tables_levels = [deque() for _ in range(nb_levels + 1)]

    for event in events:
        if isinstance(event, FlushEvent):
            ss_tables_levels[0].insert(0, event.sstable)
        if isinstance(event, CompactionEvent):
            level = event.level
            output_level = level + 1 if level < nb_levels else level
            for sstable in event.output_sstables:
                ss_tables_levels[output_level].insert(0, sstable)
            for sstable in event.input_sstables:
                ss_tables_levels[level].remove(sstable)

    store = LsmStorage()
    store.ss_tables = ss_tables_levels[0]
    store.ss_tables_levels = ss_tables_levels[1:]

    return store

This reconstruct method reads all events one by one and replays them to compute the list of SSTables at each level. The only special case is that of the last level: if we compact at this level, output files go to the same level (not the next one).

We are now able to recreate the state of the engine from a list of events. This is a great first step! Now that we know how to design the events, the next step will be to serialize them to disk, so that they can be written to and read from the manifest.