RC: W10 D4 — Overview of isolation in RocksDB

April 18, 2024

So far, my engine used a read-write lock to ensure consistency of both read and write operations. However, a read-write lock limits concurrency tremendously as readers block writers and writers block readers. To enable higher concurrency, RocksDB uses another strategy that allow readers not to block writers and vice-versa: snapshots.

A snapshot of the database provides a consistent view of the data it contains at one particular point in time. So, if instead of reading the latest data, we read from a slightly older point in time, it is possible to read while writing: while the writer is writing the newest data (which is partially written until the writer has effectively finished), a reader can read the data from the snapshot taken right before the writer started writing. This way, readers access data that is always consistent even if some data is concurrently being written.

For example, if we insert (“key1”, “value”) and later rewrite it to (“key1”, “updated_value”), we should read:

“value” if the read request was made before the second write is committed;
“updated_value” if the read request was made after the second write is committed.

Here is a visual representation of the idea (inspired from Martin Kleppmann’s book “Designing Data-Intensive Applications”):

         Writer1                             Writer2
    ("key1", "value")               ("key1", "updated_value") 
· · ----------------- · · · · · · · · · ----------------- · · · · · · · · · 
                        ^                     ^                     ^
                        |                     |                     |
                      Reader                Reader                Reader           
                     "value"               "value"            "updated_value"              


Legend:
----- = duration of the write commit
· · · = duration where there is no write commit

The way RocksDB implements this can be found in their wiki at the Transactions and Snapshot pages. In a nutshell, a monotonically increasing sequence number is assigned to each record that is written. When making a read request, it takes a snapshot of the database that displays only the records whose sequence number is smaller than or equal to the most recently committed record. In other words, every record that is being written but not committed is ignored by the readers. This way, we can have both readers and writers without creating any interference. Also, they have two methods for inserting (both are described on their wiki):

an optimistic approach that allows concurrent writes and detects conflicts at commit time;
a pessimistic approach that uses mutex locks on writes to avoid conflicts.

Keeping multiple versions and allowing to read from older versions necessarily also has impacts on other operations that run on the database (compaction, flush and freeze). I have yet to figure out the specific details on these, but as far as my understanding goes for now: before deleting compacted SSTables and flushed memtables, it will be necessary to check that they are not currently being read.

So, with all of this understanding in mind, here is what I am planning to do for the upcoming days in order to implement this feature:

Add a monotonically increasing sequence number to every record (both when stored in memory and when written to disk)
Update the read path to read from a given snapshot
Update the write path to keep the last committed sequence number in memory
Update compaction, flush and freeze operations so that they don’t delete data in a snapshot
Handle recovery to record the last committed sequence number.

The plan may change as I figure the bits out, but that is what I am planning for now.