RC: W11 D1 — Ordering sequence numbers with big-endianness and bitwise NOT
April 22, 2024The sequence number associated to each record allows to stamp a particular version of the record. Indeed, if two distinct values for the same key have been inserted in the database, the one inserted first will have a sequence number smaller than that of the one inserted second (since sequence numbers are monotonically increasing). If all the different versions are stored in the database (either in the memtables or SSTables), it becomes possible to access historical records by specifying a particular sequence number.
In my implementation, the memtable is a wrapper around a red-black tree. Therefore, if I want to access the versions of records at a particular point in time, I need to store both the record key and its sequence number as the key of each red-black tree node in order to use both pieces of information for lookups. Originally I was thinking about updating my red-black tree to store a tuple (key, sequence number) as each node’s key. But Evance whom I was pairing with advised me to convert node keys as bytes so that they would become agnostic of their content. With this, the memtable could be responsible for decoding the node’s key and value to select the correct versions of the record.
We thus did a very big refactoring to have the red-black tree handle keys as bytes instead of strings and integers (a good 50% of the 250 unit tests were broken because of this, but having them was of huge help!).
Once that was done, I needed to encode the sequence number together with the key. For the rest of this post, I will call an encoded key + sequence number a “snapshot key”. However, there is one important constraint to take into account: the ordering of these snapshot keys. More specifically:
- Distinct keys should be ordered lexicographically (as before);
- Two different versions of a record for a given key should be ordered so that the most recent version is ordered first.
For example, the following three records should be ordered as such: \((keyA, 10) < (keyA, 3) < (keyB, 8)\)
To comply with these requirements, I decided to :
- Concatenate the encoded key with the encoded sequence number, so that the key is the main ordering factor;
- Encode the sequence number using big-endianness and byte inversion, so that sequence numbers are ordered in decreasing order.
Let’s explain a bit more thoroughly the second point: why does using big-endianness with byte inversion allow numbers to be ordered in decreasing order?
Integers are usually stored with multiple bytes (usually 4 or 8 bytes depending on the range that the integers span –
even if that can be fewer if numbers are small).
Endianness corresponds to the order in which bytes are written.
With big-endianness, the most significant byte is stored first and the byte-level representation thus aligns with their
numerical order.
For example, the big-endian forms for 259 (0x0103)
, 520 (0x0208)
and 1024 (0x0400)
are 01 03
, 02 08
and 04 00
respectively.
If they are compared byte by byte, they would correctly be ordered (259 < 520 < 1024
) since 01
is smaller than 02
which is smaller than 04
.
However, with little-endianness, their forms would have been 03 01
, 08 02
and 00 04
and thus, they would have been
incorrectly ordered when compared byte by byte (1024 < 259 < 520
since 00 < 03 < 08
).
So, the integers are sorted in their numerical order with big-endianness, but they need to be in the reverse order.
The order can be reversed by applying the bitwise NOT inversion to each byte of the sequence number.
In other words, each byte becomes its complementary to 255 (0xFF)
.
In practice, this is done by masking the inverted byte with the 0xFF
mask (bytes(~b & 0xFF for b in seq_num_bytes)
).
Following with the previous example, 259 (0x0103)
, 520 (0x0208)
and 1024 (0x0400)
would be encoded respectively as
FE FC
, FD F7
and FB FF
and would thus be ordered in reverse order (1024 < 520 < 259
) since FB < FD < FE
.
Yay! 🎉
Implementing the snapshot key with this idea can then be done in only 3 lines:
import struct
class Record:
def __init__(self):
self.key = ...
self.sequence_number = ...
...
def snapshot_key(self):
# Encode the sequence number in big-endian format
seq_num_bytes = struct.pack('>Q', self.sequence_number)
# Invert the bytes of the sequence number
inverted_seq_num_bytes = bytes(~b & 0xFF for b in seq_num_bytes)
return self.key + inverted_seq_num_bytes
I now have snapshot keys that are ordered meaningfully when encoded. This will be super useful: even if the red-black tree is completely agnostic of the content of its nodes’ keys, it will be able to display values in the expected order!