RC: W10 D2 — Adding checksums to the WAL and the manifest

April 16, 2024

Now that the manifest and WALs (write-ahead logs) have been implemented, my engine is able to recreate its state when it is restarted. But what happens if the engine crashes while data is being inserted? Or while an operation (like flushing or compaction) is being performed? In that case, the data written may be corrupted and the engine may thus return some corrupted records. Therefore, it is important to add a mechanism to avoid such corruption. The typical way that this issue is tackled is by adding checksums to critical data. But what data is likely to be corrupted and should thus be double-checked by a checksum?

First, a crash may occur while a new record is being written to the WAL. If that is the case, the last record should be excluded, and thus, the memtable would be recovered until the penultimate insertion. So in the worst case scenario, one record in the whole engine is lost.

In the case of WALs, we can compute a checksum for the record before it is written to the WAL, and prepend the checksum to the record. Then, when reading the WAL, the checksum is read together with the record. Before considering the record as valid, the actual checksum of the record just read is computed and compared to the expected checksum. If they are identical, we can proceed to the next record. Otherwise, this record is excluded from the WAL recovery.

Second, a crash may occur while an internal operation (flush or compaction) is occurring. Given that the operation is recorded in the manifest after the new SSTable files have been created but before the old ones are deleted, most problematic cases are covered: we can recover to either the previous or the current state if a crash happens when before or after writing to the manifest. However, there is still no safety if a crash happens while the operation is being recorded onto the manifest. If that were the case, the best behavior would be to ignore the last operation and recover to the penultimate state. To do that, I also added a checksum to each operation recorded onto the manifest and I consider the operation only if the re-computed checksum matches the expected one.

To give an idea, here is the code for the manifest (the logic is very similar for the WAL):

class ManifestFile:
    def write_event(self, event: Event):
        encoded_record = ManifestRecord(event=event).to_bytes()
        encoded_checksum = Checksum(data=encoded_record).to_bytes()

        # Write checksum first and record second
        self.file.write(encoded_checksum)
        self.file.write(encoded_record)
        self.file.flush()

    @staticmethod
    def decode_events(data: bytes) -> list[Event]:
        events = []
        while len(data):
            # Read checksum and record
            checksum_size = Checksum.nb_bytes
            expected_checksum = Checksum.from_bytes(data=data[:checksum_size])
            manifest_record = ManifestRecord.from_bytes(data=data[checksum_size:])

            # Ignore rest of WAL if record is corrupted
            record_checksum = Checksum(data=manifest_record.to_bytes())
            if record_checksum != expected_checksum:
                logging.info("Record corrupted. Ignoring the rest of the WAL")
                break

            # Proceed if record is not corrupted
            event = manifest_record.event
            checkpoint = manifest_record.size + checksum_size
            events.append(event)
            data = data[checkpoint:]
        return events

There is yet another case that RocksDB handles: they add checksums to each data block being written. This is a safety net in case there is a problem when actually writing the data to disk. In that case, they mark the data block as corrupted and stop serving requests from it. They can later recover the non-corrupted data for this data block from a backup for example. In my case, I decided not to implement this, as I don’t have backups to recover from. But I will keep this in mind for the future if I decide to push the project further.