RC: W4 D3 — Freezing memtables | Software under the hood

RC: W4 D3 — Freezing memtables

March 6, 2024

Memtables, the in-memory components of LSM-trees, keep the records sorted in-memory. But they cannot grow indefinitely: they need to be flushed to disk at some point so that the data can be persisted.

However, the process of writing to disk is an I/O operation that can take some time and, ideally, the storage engine should carry on accepting new reads and writes in the meantime. To achieve this, we “freeze” the active memtable (i.e. it becomes immutable and does not accept any new writes), and a new active memtable takes over the new writes. Later, the frozen one will be flushed to disk.

So far, the principle sounds easy, but implementing it in practice gets much harder if we want to allow concurrent writes and avoid race conditions. After a whole day of trying to figure this out, I think that I finally understand how it can (should?) be done. For reference: I spent a lot of time today reading the code for freezing of the mini-LSM repo and this really helped me understand it, so I would recommend this reference to anybody interested in digging this further. Also, I definitely need to thank Alex for patiently helping me figure out the bits that I had not quite understood (and would not have without him).

So… During the freeze process, we want to:

Guarantee that we don’t lose or corrupt any data (consistency)
Impact performance as little as possible (high read/write throughput)

To achieve this, we can create a .try_freeze() method that will check, at every insert, whether the freeze operation should occur. If it is the case, then we can effectively freeze the memtable (execute .do_freeze()). Here is how I implemented it:

import threading

THRESHOLD = 1500000  # 1.5 MB (or any other value) 


class LsmStorage:
    def __init__(self):
        self.memtable_approximate_size = 0
        self._freeze_lock = Mutex()
        self._state_lock = ReadWriteLock()
        # NB: Implementation of the Mutex and ReadWriteLock in the previous post

    def insert(self, key, value):
        # first perform insertion
        # ...
        # then compute new size
        self.memtable_approximate_size += len(key) + len(value)

        # then check if a freeze operation should occur
        self.try_freeze()

    def try_freeze(self):
        with self._state_lock.read():
            approximate_size = self.memtable_approximate_size

        if approximate_size >= THRESHOLD:
            with self._freeze_lock:
                with self._state_lock.read():
                    latest_approximate_size = self.memtable_approximate_size
                if latest_approximate_size >= THRESHOLD:
                    self.do_freeze()

    def do_freeze(self):
        pass

Here is what happens in the .try_freeze() method:

First, it checks whether the memtable has reached the THRESHOLD size.
If that is the case, then it acquires the freeze lock and checks a second time if the size of the memtable is still above the THRESHOLD
If that is the case, then it executes the freeze operation.

This strategy guarantees that:

Only one freeze operation can occur at any given time (thanks to the Mutex lock that grants exclusive access on the freeze operation).
We have good performance: because we acquire the freeze lock only after checking the conditions for freezing are met, we are guaranteed that it is acquired only when absolutely necessary and thus does not prevent other kinds of operations which might need acquiring this lock.
The freeze operation does not occur on empty memtables: this is why the condition is checked a second time (once the freeze lock has been acquired). Indeed, while waiting to be granted the freeze lock, another operation might have been freezing the memtable and release its lock afterward. Thus, when we are granted the lock, we need to check that the condition is still met.

Also, on a side note, every time we need to read the size of the memtable, we acquire the read lock on the state. Why bother with a second locking mechanism ? The reason to do this is to separate concerns: the mutex self._freeze_lock is for freeze operations, the ReadWriteLock self.state_lock is for reading/writing on the state. It might seem tedious at first to have two different locks, but it should actually make things easier to maintain in the long term, as we will know exactly which lock to acquire for each kind of operation.