RC: W2 D5 — How Bitcask allows fast boots up

February 23, 2024

Today, I worked on the boot up process for my Bitcask prototype so here is the gist of it. When opening a Bitcask datastore, a bunch of datafiles may already be present (if you open an already existing datastore). So, at boot up, I need to instantiate and fill in the in-memory hash index (“keydir”) with the file and position where the correct value for each key can be retrieved.

One way to do this is to read all datafiles from oldest to newest and fill in the keydir in the meanwhile. That would work, but may take long because datafiles may be huge (especially if values are big chunks of bytes, like images or videos or whatnot).

To speed up the boot up process, the Bitcask team came up with the idea to add “hint files” close to the datafiles. More precisely, they add a hint file every time they create a new merged file.

Each hint file item contains the following:

+-----------+----------+------------+----------------+----------------+  
| timestamp | key_size | value_size | value_position |      key       |
+-----------+----------+------------+----------------+----------------+
|  32 bits  | 32 bits  |  32 bits   |    32 bits     |  key_size bits |
+-----------+----------+------------+----------------+----------------+

This content is fairly similar to what needs to be stored in the keydir. Indeed, the keydir needs, for each key:

Therefore, instead of reading the full datafiles, it only needs to read the hint files. (This also means that we need to create those hint files every time we create a new merged file.)

Note that since the hint files are created only when merging, the boot up process will also need to parse the datafiles that have not yet been merged (but there should not be too many).

So I implemented all of this today: creating the hint files during the merge process + reading from the hint files + remaining datafiles at boot up to create the keydir. Now my Bitcask prototype can boot up fast! 🍾