RC: W8 D4 — Triggering compaction | Software under the hood

RC: W8 D4 — Triggering compaction

April 4, 2024

I am now able to compact a full level of SSTables into the next level. The remaining question is: when should we trigger compaction ?

As far as I know, the answer to this question was first given in the original LSM-tree paper. This paper introduced the concept of LSM-trees as an alternative to traditional B-trees for environments with large volumes of write operations. The main goal of LSM-trees that is highlighted in the paper is to reduce the disk I/O required for large volumes of insert operations.

Naturally thus, when talking about compaction (which they refer to as “rolling merge processes” in the paper), the focus is put on minimizing the I/O rate (i.e. the I/O operations used either for reading from disk to memory or for writing merged pages to disk). SSTables are distributed over multiple levels of the LSM-tree (labeled 0 to K) and, to define the optimal parameters, the authors of the paper define what they call the “size ratio between adjacent pairs of components”. This size ratio corresponds to the ratio of the total size (in bytes) of all SSTables on a given level over that of the previous one:

\[r_i = {S_{i+1} \over S_{i}}\]

They demonstrate that the I/O rate is minimized when the ratios for all levels are all equal to a common value \(r\). For anybody interested to understand why that is, they provide a proof for this right after the stated theorem (theorem 3.1 in the paper).

Nevertheless, we can get an intuition as to why that is true:

A smaller \(r_i\) means that two adjacent levels are closer in size, which leads to more frequent merges. More frequent merges (i.e. more compaction processes) induces a higher write amplification and thus more I/O operations.
A higher \(r_i\) means that level \(i + 1\) is much bigger than level \(i\). Therefore, when compaction is triggered at this level, the volume of data to merge is big, thus resulting in both increased disk reads (to get all the data that needs to be merged) and increased disk writes (to write that data back onto disk). Also, since more data accumulates to that level, compaction will be triggered less frequently, thus possibly leading to long periods without any compaction interspaced with bursts of I/O operations.

Thus, to minimize the number of I/O operations, the optimal solution is to have a constant \(r\) value across all levels.

From what I have read, the typical value usually selected for this common \(r\) revolves around 10-20 (i.e. each level has 10 to 20 times the capacity of the previous level). Once this initial value is set, read and write amplification can be monitored (for example, RocksDB provide a bunch of statistics on compaction) and the value for \(r\) can be fine-tuned depending on how the system fares under the workload it is experiencing.

Whenever a level reaches a size that is \(r\) times bigger than that of the previous one, compaction is triggered for that level.