RC: W10 D1 — Writing to the manifest | Software under the hood

RC: W10 D1 — Writing to the manifest

April 15, 2024

There was one last thing left to do regarding the manifest: write to it whenever a new operation occurs. This is fairly easy to do, but there is nonetheless one aspect to think about: when exactly should the operation be reported in the manifest? Before occurring? After occurring? While occurring?

Here is how I approached the problem. The manifest is here so that the state can be recreated from the files stored on disk whenever the engine restarts. Therefore, it is important that all the files necessary to recover the state have been written before the corresponding operation is logged to the manifest. On the other hand, in case there is a crash before the operation has been logged onto the manifest, it is also important that the state can be recreated (even if that would be the second-to-last state). Therefore, the operation should be logged to the manifest before the files it is supposed to delete are effectively deleted. And that should do the trick.

Flush operations create a new SSTable from a memtable that is backed up by a write-ahead log (WAL). In this case, it would be wise to write to the manifest once the new SSTable has been created and added to the state, but before the WAL has been deleted so that the memtable can be recreated in case of a crash.

Here is the pseudocode for this operation:

class LsmStorage:
    def _flush(self) -> None:
        # Perform the flush operation building the new sstable
        new_sstable = ...

        # Update the state
        flushed_memtable = self.state.immutable_memtables.pop()
        self.state.sstables_level0.insert(0, new_sstable)

        # Write to manifest
        event = FlushEvent(sstable=new_sstable)
        self.manifest.add_event(event=event)

        # Delete the WAL
        flushed_memtable.wal.delete()

As for the compaction operation, the idea is similar: compaction takes a bunch of SSTables (input) and compacts them into new SSTables at the next level (output). Thus, the operation should be logged in the manifest after the output SSTables have been created and added to the state but before the input SSTables are deleted.

Here is the pseudocode for this operation:

class LsmStorage:
    def _compact(self, level: int) -> None:
        # Perform the compact operation building the new sstables
        sstables_to_compact = [...]
        new_sstables = [...]
        next_level = ...

        # Update the state
        self.state.sstables_levels[next_level].extendleft(reversed(new_sstables))
        for sstable in sstables_to_compact:
            self.state.sstables_levels[level].remove(sstable)

        # Write to manifest
        event = CompactionEvent(
            input_sstables=sstables_to_compact,
            output_sstables=new_sstables,
            level=level
        )
        self.manifest.add_event(event=event)

        # Delete the compacted sstables
        for sstable in sstables_to_compact:
            sstable.delete()

With this, everything should be safe: no matter when a crash happens, it should be possible to recreate the state.

That being said, I thought of two additional concerns while writing this post.

The first is: what happens if a crash happens while the operation is being logged to the manifest? The last event in the manifest would be only partially written and should thus be discarded. This is something that can be handled by adding a checksum to check the integrity of the record being read. Nevertheless, if the record is partially written, we could exclude it and recreate the previous state without any problem.

The second concern is: if there is indeed a crash, then some files may still be kept on disk (thus using disk space) while not necessary anymore. One way to deal with this would be to create a background process that searches for dangling SSTable files and deletes them when no state refers to them. As of now, I don’t think I will tackle this issue, but that would be a fun feature to implement.