RC: W11 D1 — Ordering sequence numbers with big-endianness and bitwise NOT

April 22, 2024

The sequence number associated to each record allows to stamp a particular version of the record. Indeed, if two distinct values for the same key have been inserted in the database, the one inserted first will have a sequence number smaller than that of the one inserted second (since sequence numbers are monotonically increasing). If all the different versions are stored in the database (either in the memtables or SSTables), it becomes possible to access historical records by specifying a particular sequence number.

In my implementation, the memtable is a wrapper around a red-black tree. Therefore, if I want to access the versions of records at a particular point in time, I need to store both the record key and its sequence number as the key of each red-black tree node in order to use both pieces of information for lookups. Originally I was thinking about updating my red-black tree to store a tuple (key, sequence number) as each node’s key. But Evance whom I was pairing with advised me to convert node keys as bytes so that they would become agnostic of their content. With this, the memtable could be responsible for decoding the node’s key and value to select the correct versions of the record.

We thus did a very big refactoring to have the red-black tree handle keys as bytes instead of strings and integers (a good 50% of the 250 unit tests were broken because of this, but having them was of huge help!).

Once that was done, I needed to encode the sequence number together with the key. For the rest of this post, I will call an encoded key + sequence number a “snapshot key”. However, there is one important constraint to take into account: the ordering of these snapshot keys. More specifically:

For example, the following three records should be ordered as such: \((keyA, 10) < (keyA, 3) < (keyB, 8)\)

To comply with these requirements, I decided to :

  1. Concatenate the encoded key with the encoded sequence number, so that the key is the main ordering factor;
  2. Encode the sequence number using big-endianness and byte inversion, so that sequence numbers are ordered in decreasing order.

Let’s explain a bit more thoroughly the second point: why does using big-endianness with byte inversion allow numbers to be ordered in decreasing order?

Integers are usually stored with multiple bytes (usually 4 or 8 bytes depending on the range that the integers span – even if that can be fewer if numbers are small). Endianness corresponds to the order in which bytes are written. With big-endianness, the most significant byte is stored first and the byte-level representation thus aligns with their numerical order. For example, the big-endian forms for 259 (0x0103), 520 (0x0208) and 1024 (0x0400) are 01 03, 02 08 and 04 00 respectively. If they are compared byte by byte, they would correctly be ordered (259 < 520 < 1024) since 01 is smaller than 02 which is smaller than 04. However, with little-endianness, their forms would have been 03 01, 08 02 and 00 04 and thus, they would have been incorrectly ordered when compared byte by byte (1024 < 259 < 520 since 00 < 03 < 08).

So, the integers are sorted in their numerical order with big-endianness, but they need to be in the reverse order. The order can be reversed by applying the bitwise NOT inversion to each byte of the sequence number. In other words, each byte becomes its complementary to 255 (0xFF). In practice, this is done by masking the inverted byte with the 0xFF mask (bytes(~b & 0xFF for b in seq_num_bytes)). Following with the previous example, 259 (0x0103), 520 (0x0208) and 1024 (0x0400) would be encoded respectively as FE FC, FD F7 and FB FF and would thus be ordered in reverse order (1024 < 520 < 259) since FB < FD < FE. Yay! 🎉

Implementing the snapshot key with this idea can then be done in only 3 lines:

import struct


class Record:
    def __init__(self):
        self.key = ...
        self.sequence_number = ...
        ...

    def snapshot_key(self):
        # Encode the sequence number in big-endian format
        seq_num_bytes = struct.pack('>Q', self.sequence_number)

        # Invert the bytes of the sequence number
        inverted_seq_num_bytes = bytes(~b & 0xFF for b in seq_num_bytes)

        return self.key + inverted_seq_num_bytes

I now have snapshot keys that are ordered meaningfully when encoded. This will be super useful: even if the red-black tree is completely agnostic of the content of its nodes’ keys, it will be able to display values in the expected order!