RC: W4 D3 — Freezing memtables

March 6, 2024

Memtables, the in-memory components of LSM-trees, keep the records sorted in-memory. But they cannot grow indefinitely: they need to be flushed to disk at some point so that the data can be persisted.

However, the process of writing to disk is an I/O operation that can take some time and, ideally, the storage engine should carry on accepting new reads and writes in the meantime. To achieve this, we “freeze” the active memtable (i.e. it becomes immutable and does not accept any new writes), and a new active memtable takes over the new writes. Later, the frozen one will be flushed to disk.

So far, the principle sounds easy, but implementing it in practice gets much harder if we want to allow concurrent writes and avoid race conditions. After a whole day of trying to figure this out, I think that I finally understand how it can (should?) be done. For reference: I spent a lot of time today reading the code for freezing of the mini-LSM repo and this really helped me understand it, so I would recommend this reference to anybody interested in digging this further. Also, I definitely need to thank Alex for patiently helping me figure out the bits that I had not quite understood (and would not have without him).

So… During the freeze process, we want to:

  1. Guarantee that we don’t lose or corrupt any data (consistency)
  2. Impact performance as little as possible (high read/write throughput)

To achieve this, we can create a .try_freeze() method that will check, at every insert, whether the freeze operation should occur. If it is the case, then we can effectively freeze the memtable (execute .do_freeze()). Here is how I implemented it:

import threading

THRESHOLD = 1500000  # 1.5 MB (or any other value) 


class LsmStorage:
    def __init__(self):
        self.memtable_approximate_size = 0
        self._freeze_lock = Mutex()
        self._state_lock = ReadWriteLock()
        # NB: Implementation of the Mutex and ReadWriteLock in the previous post

    def insert(self, key, value):
        # first perform insertion
        # ...
        # then compute new size
        self.memtable_approximate_size += len(key) + len(value)

        # then check if a freeze operation should occur
        self.try_freeze()

    def try_freeze(self):
        with self._state_lock.read():
            approximate_size = self.memtable_approximate_size

        if approximate_size >= THRESHOLD:
            with self._freeze_lock:
                with self._state_lock.read():
                    latest_approximate_size = self.memtable_approximate_size
                if latest_approximate_size >= THRESHOLD:
                    self.do_freeze()

    def do_freeze(self):
        pass

Here is what happens in the .try_freeze() method:

This strategy guarantees that:

Also, on a side note, every time we need to read the size of the memtable, we acquire the read lock on the state. Why bother with a second locking mechanism ? The reason to do this is to separate concerns: the mutex self._freeze_lock is for freeze operations, the ReadWriteLock self.state_lock is for reading/writing on the state. It might seem tedious at first to have two different locks, but it should actually make things easier to maintain in the long term, as we will know exactly which lock to acquire for each kind of operation.