RC: W8 D3 — Compacting SSTables
April 3, 2024So far, my LSM engine is able to flush memtables into SSTables. Since records in memtables are sorted and each memtable gives birth to one SSTable, records in each individual SSTable are sorted too. However, they are not sorted across SSTables. That is to say: if I select two SSTables, they are very likely to have overlapping records. Therefore, when searching for a given record, I would need to read all SSTables one by one (starting from the most recently-written one to the oldest one). This is what is called “read amplification”. Formally, read amplification is defined as the number of disk reads per query (as defined in the RocksDB tuning guide).
To reduce read amplification, we can perform “compaction” of the SSTables. Compaction consists in merging those files together while removing obsolete records (that have been overwritten since). Since SSTables are already sorted, we can in fact apply a regular merge-sort algorithm. This is efficient space-wise since, for SSTable being merged, we only need to read and store a single record in memory. In other words, the complexity in space is only \(o(n)\) (for \(n\) SSTables) instead of \(o(nm)\) if each file had \(m\) records on average and was not sorted.
Implementing this was rather straightforward: I could use the merge iterator to iterate over all records of the SSTables and rewrite them in a fully sorted order in a new set of SSTables. The only thing I needed to keep in mind was to create SSTables that don’t exceed their maximum size. Therefore, when merging \(n\) SSTables, at most \(n\) new SSTables can be generated (if all records are unique) or fewer (if some records have been rewritten), but there should never be more.
With this, two distinct levels of SSTables are emerging:
- The first level is the one corresponding to the direct flush of each immutable memtable to disk: this is “level 0”. Each distinct SSTable is sorted but overall, SSTables on this level may overlap with one another (in other words, the records are not sorted across SSTables at this level).
- The second level contains the results of the compaction of level-0 SSTables: this is “level 1”. At this level, SSTables do not overlap with one another. Therefore, when searching for a given key in this level, we only need to read one single file.
In practice, LSM engines usually have more than 2 levels (as of now, I don’t really know how the number of levels should be configured. That is something I will have to look up at some point). Though, all the other levels share the same characteristic as level 1: no SSTable overlap with another from the same level. This means that, when looking for a given record, we only need to read one single SSTable file per level, so the read amplification becomes pretty low.
There are multiple strategies for compaction (for an overview, refer to the RocksDB documentation on compaction). I decided to implement only one: “Classic Leveled” compaction, which was the one described by the original LSM-tree paper. The principle is relatively easy to understand: all SSTables of a given level are compacted and merged into a new set of SSTables positioned at the next level. Therefore, implementing this was relatively straightforward.
A more interesting matter is the way we can think about concurrency for this operation. Compacting takes a long time, since we need to 1) read all SSTables of one level, 2) iterate over all records and 3) create a new set of compacted SSTables. Therefore, it would be wise to avoid taking a lock during this time. One way to do this is to run the compaction as a side operation, and to modify the state only once the new SSTable files have been fully generated. This way, we can minimize the duration during which the lock is held by the compaction process. On the other hand, while the state is being modified (i.e. when SSTables are removed from one level and added to another one), we’d better have no interference with other processes that may change the same levels. This includes both other compaction operations and flush operations (that add SSTables to level 0). Therefore, I used the same mutex lock as for the flush operations to handle state changes due to compaction. As a side note, according to [their wiki](https://github.com/facebook/rocksdb/wiki/Leveled-Compaction, RocksDB are able to have multiple compaction operations execute concurrently, as long as they don’t apply to the same SSTables. That would definitely be better, but for my implementation, I decided only one compaction operation running at a time would be good enough for now.