RC: W9 D5 — Implementing a Write-Ahead Log (WAL)

April 12, 2024

The manifest allows to recreate the state of the engine by replaying all compact and flush operations in order. This allows to retrieve all SSTables written to disk. However, if there is a crash before the data has been flushed to disk (i.e. when it is still in the memtables), then all that data is lost. Given that the engine may contain multiple memtables and those are typically up to 256MB large, that may sum up to a massive amount of data. In order to recover them, the typical solution is to use what is called a “write-ahead log” (WAL).

The WAL is a log file in which records are usually written before being added to the memtable. Therefore, if the database crashes, the data is not lost: it can be retrieved from the WAL. One important thing to note is that the records in the WAL are written in insertion order, and not in sorted order as in the SSTable file. This does not matter in any way, since the WAL is not meant to be read to look for records, but only there to recover the memtable in case of a crash.

Today, Ishan and I paired to implement WALs on my engine. Each memtable was assigned one WAL upon creation, and we modified the write path to log the record into the WAL before inserting it into the associated memtable. There is one trade-off we wondered about: should we flush the records written in the WAL right away or buffer a few before flushing them all at once? We decided to go for the former which is the best option to make sure we don’t lose any data.

However, for a use case where losing a few records is acceptable (for example, all those written during the last minute or hour), a better option would be to buffer them and flush them altogether as this would require less disk I/O. I will not implement it, but adding the possibility for the user to tune this trade-off themselves depending on their needs would definitely be an interesting feature.

Once we had one WAL per memtable, we proceeded to implement crash recovery: records in the WAL are decoded one by one until the full file is parsed and inserted each in a new memtable. With all this done, the engine can now recover from crashes!

There were a few more things to think about.

First, since the engine can hold multiple memtables at any given time, there may be multiple WALs present at the same time (they are deleted only when the memtable is flushed to disk). Each allows to recover one memtable. But how to know which corresponds to more recent memtables, and thus, how should recovered memtables be ordered in memory? We can use the WAL timestamp to order them: smaller timestamps correspond to older memtables.

Second, when we recover from a crash, the memtable may be almost full, completely empty or at any intermediate stage (on average: it should be half full). When we recover, even if one of the memtable was the current one and thus not full, it will be recovered as an immutable one. This simplifies things as there is no need to handle modifying/deleting the last line of the WAL in case it was partially written, and this does not create any particular problem.

Third, there may be cases where the crash happens once the memtable has been flushed into an SSTable but before the associated WAL was deleted. In that case, the recovered memtable should be identical to one SSTable. It does not matter much: the duplication will be removed after one round of flushing and compacting. Therefore, this edge case can be ignored. However, there should be some need to run a background process to find and delete all dangling WALs so that they are not used for recovery every time.

Last, the case where a crash occurs while the record is written to the WAL is not handled yet. I will see next week how to handle this.