RC: W6 D3 — Why RocksDB is optimized for SSDs

March 20, 2024

The first thing mentioned in the RocksDB wiki is that it is optimized for Flash storage (i.e. the storage technology used by most SSDs) but they don’t explain why specifically. I have been wondering about that ever since I started working on my LSM, and I think I understood one of the reasons for that today and I thought I would thus share this.

To understand it, we need to understand how SSDs work. The smallest write unit on SSDs is a page (which is 4 to 16KB long usually). This means that every time we write anything (even if it is as small as one single byte), a full page gets written (including many empty bytes).

When flushing memtables into SSTables blocks, all records are added into a buffer, and the block is written to disk only once the buffer is full. If we did not use a buffer and wrote all records one by one, there would obviously be many more I/O operations. But on top of that, records would be scattered across multiple pages. At some point, the SSD’s garbage collector would need to erase all blocks containing those pages, and rewrite a new more condensed page containing all of these records. This would mean higher write amplification (records need to be written once and then re-written a second time to condense them). This would also mean higher CPU usage for the garbage collector. Instead, by first putting all records in a buffer, and flushing them to disk only once the block is full, we need only one I/O operation and most importantly, the data is already condensed, thus preventing write amplification.

I am pretty sure there are some more optimizations linked to SSTables compaction. I will explain those when I get there!