RC: W12 D1 — Compaction with multiple versions

April 29, 2024

Compaction allows to reclaim disk space: obsolete records are discarded during this operation. Before allowing my engine to store multiple historical versions, I simply kept only the most recent record version for each key and discarded all others. However, now that my engine is able to store multiple versions, and that read requests access versions from a historical state, things get a little more complex. Indeed, versions that are currently being accessed by a read request must not be deleted (at least not until nobody else needs them). But how to know which versions should be kept?

Here is how we can think about it. A read query takes a snapshot of the database and reads from that snapshot. Therefore, any version that should be visible in that snapshot should not be removed. A snapshot allows to see the database at the state in which it was at this historical point in time. Thus, for a given snapshot, one single version can be accessed for each key: namely, the version that has the highest sequence number among all versions whose sequence number is equal or smaller than the snapshots. So we know that we want to keep at least this version. Hereunder is a schema that recaps the reasoning:

              Snapshot
                 |
                 |
                 V
        
KeyA   A1        A2    A3
KeyB       B1 B2 ^         B3   B4
KeyC     C1   ^  |   C2
         ^    |  |   <------------>
         |    |  |   After snapshot 
         |    |  |   => not visible
         |    |  |
         |    |  |
         C1, B2, A2
        are selected
      (latest versions 
       before snapshot) 

Second, read queries use the last committed sequence number as the snapshot, and this number keeps on increasing since, by definition, it is monotonically increasing. Therefore, all the versions that are above the snapshot may be used by future queries, which is why these should not be deleted either.

Third, it is possible that multiple read queries (with identical or different snapshots) could occur concurrently. To avoid any problem, the easiest decision is to be the most conservative and consider the oldest snapshot.

All in all thus, we need to:

Record all the snapshots that are being used by read queries at any given point in time;
Select the oldest snapshot (the one with the smallest value);
Keep all versions that are greater than this snapshot as well as the last version that is below or equal to this snapshot;
All older versions may be discarded.

To implement this strategy, I did 2 things. First, I recorded all snapshots currently used in memory: whenever a get or scan query starts, the snapshot taken is added to that collection, and it is removed from it when the read query terminates. I decided to store these in a set, as both insertions and removals in a set occur in constant time (o(1)). The drawback is that finding the minimum value takes o(n) time. However, given the fact that insertions and deletions should occur dramatically more often than compaction (which requires to find the minimum), this seemed like a relevant trade-off. Second, when running compaction on a transactional engine, this snapshot is used to define which versions are kept or not (following the rules defined above).

Compaction now works on the transactional engine: with this feature, I can avoid running out of disk space, while still maintaining multiple versions!