RC: W8 D2 — Scanning SSTables for range queries
April 2, 2024When executing a range query (i.e. a query that fetches all records whose key falls within the specified range), RocksDB does not load all the items into memory at once. Instead, it uses an iterator to traverse through the records lazily (i.e. loading them as needed). This pattern allows to manage resources efficiently when dealing with large datasets and is common across database management systems.
Since I now have specific iterators for each component (cf yesterday’s post), it should be easy to implement the scan:
essentially, the only thing to do is to provide the boundaries to the iterator (start
and end
keys) so that it
iterates only over this range.
I already did that for memtables some time ago and only needed to move the corresponding code to the memtable iterator.
However, for on-disk components, it still needed to be implemented and that is what I worked on today.
To iterate over a specified range, we need to position the iterator at the first item in the range, and then keep on iterating until we get to the last one. Since SSTables are broken down into multiple data blocks, I thus needed to find the first block that might contain the searched key. I can do that by iterating over all meta blocks (that summarize the first and last key of each data block) until I find one whose first and last keys are respectively lesser and greater than the searched one.
Here is a visual representation of the process:
FK1 LK1 FKi LKi FKn LKn
+------------------+---------+------------------+---------+------------------+
| Data Block 1 | ... | Data Block i | ... | Data Block n |
+------------------+---------+------------------+---------+------------------+
(FK = first key, LK = last key)
Look at all sets of keys until they contain `start key`:
- FK1 < LK1 < Start key
- FK2 < LK2 < Start key
- ...
- FKi < Start key < LKi
=> The first Data Block to read is Data Block i
Once this starting block is found, we can iterate on this block and on all the following ones until we find a key
greater than the end
boundary.
Within the first data block, we need to position the cursor at the start
key (not at the beginning of the block!).
Thus, I implemented a binary search to find the index of the first key that is equal or above the start
key.
In practice, this implied reading the chunk corresponding to offsets to store them in memory and, at each iteration,
reading the record corresponding to the selected offset to get its key and compare it to the searched one.
I now have a way to run range queries on each SSTable!
That’s a good step, but this is not completely over: executing the scan on the LSM engine requires merging the scans on all memtables and all SSTables. Previously, when I designed my memtables, I had decided to store the records formatted in the same way as they would be in the SSTables. When inserting or reading from a memtable alone, this requires a bit more work because I need to encode and decode my records each time. However, at this stage, this choice became particularly convenient: merging records coming from the memtables and the SSTables requires no extra work since they are already in the same format.