RC: W11 D5 — Recovering the state of a transactional engine
April 26, 2024The transactional capability of my engine basically comes down to the fact that a monotonically increasing sequence number is associated to each write made in the database: this allows to maintain multiple versions of each record and to retrieve the state of the database at any historical point in time. Thus, new record versions can be written to the database while read queries can access the versions of a prior state, hence protecting against any interference between reads and writes. This approach is valid as long as every new write has a sequence number strictly greater than the previous one. As of now, this is guaranteed by the fact that the last committed sequence number is recorded in memory and updated after every new write. But what happens if the engine crashes or is restarted? In that case, it is important to be able to recover the last committed sequence number so that it can continue increasing monotonically.
To recover this value, it needs to be stored on disk. The most recent records are written to the memtable, which is backed by a write-ahead log (WAL). The WAL contains the records in insertion order and, since I previously added the sequence number to each encoded record, we can simply extract the last sequence number from the last record written to the WAL. So there is nothing more to do about it.
Remember though that when the engine is closed, all memtables are flushed to disk into SSTables. As a consequence, the state is recovered solely from SSTables, not from WALs/memtables. Since records are all written to SSTables, we could simply read all SSTables to find the biggest record, but that would take too long during recovery. The alternative approach is to record the maximum sequence number in each SSTable’s metadata. With this, we can only read the metadata to recover the last committed sequence number on the engine.
So that is what I did today: I updated the SSTables so that their metadata included this extra piece of information and added a method that allows to recover the last committed version from the bunch of SSTables. Here is the chunk of code allowing to recover:
class TransactionalLsmStorage(LsmStorage):
def __init__(self,
last_committed_sequence_number: int,
**kwargs,
):
super().__init__(**kwargs)
self.last_committed_sequence_number: int = last_committed_sequence_number
@classmethod
def reconstruct(cls, manifest_path: str) -> "TransactionalLsmStorage":
lsm_storage = LsmStorage.reconstruct(manifest_path=manifest_path)
last_committed_sequence_number = cls._get_last_committed_sequence_number(
state=lsm_storage.state
)
return cls(
configuration=lsm_storage.manifest.configuration,
directory=lsm_storage.directory,
state=lsm_storage.state,
manifest=lsm_storage.manifest,
last_committed_sequence_number=last_committed_sequence_number
)
@staticmethod
def _get_last_committed_sequence_number(state: LsmState) -> int:
max_sequence_number = -1
for level_ss_tables in [state.sstables_level0, *state.sstables_levels]:
for sstable in level_ss_tables:
max_sequence_number = max(
max_sequence_number,
sstable.max_sequence_number
)
return max_sequence_number
It is pretty straightforward: once the state has been reconstructed using what was previously done for the regular
engine (LsmStorage.reconstruct(manifest_path=manifest_path)
), the last committed sequence number can be
retrieved by checking the maximum committed sequence number in all SSTables (_get_last_committed_sequence_number
method) and that is all there is to it.
We can now fully recover a transactional LSM engine!