RC: W5 D3 — Merging iterators (part 2)

March 13, 2024

My memtables rely on red-black trees, and I have implemented the latter such that, if I insert a record whose key is a duplicate of another record in the tree, I replace it with the new value. Therefore, I am guaranteed to never have any duplicates within the same memtable (here, by “duplicate”, I mean “two records that have the same key”).

However, the records in one memtable may be duplicates of others located in a different memtable. Thus, when merging iterators from multiple memtables, I need to make sure I return only the correct duplicate (i.e. the most recently-inserted record). In order to separate concerns, I created a second function whose responsibility would be to filter out the duplicates. It can be implemented very simply: the first time it encounters a new key, it yields the record, and every duplicate that shows up afterward is ignored.

def filter_duplicates(iterator: Iterator[Record]) -> Iterator[Record]:
    previous_item = None
    for item in iterator:
        if previous_item is not None and item.is_duplicate(previous_item):
            continue
        yield item
        previous_item = item

The filter_duplicates function takes the merged iterator (containing duplicates) as input and returns an iterator without duplicates.

To make sure this simple implementation is correct, I need to make sure that the first duplicate in the list is the most recently-written one. In other words, I need to make sure that the merge_iterators function implemented yesterday would yield them in this order. Given that my merge_iterators function uses a heapq and that iterators are passed in most to least recent order to it, I made a very simple experiment to check how duplicates are inserted:

import heapq


class Item:
    def __init__(self, key, id):
        self.key = key
        self.id = id

    def __lt__(self, other: "Item"):
        return self.key < other.key


heap = []
heapq.heappush(heap, Item("a", 1))
heapq.heappush(heap, Item("a", 3))
heapq.heappush(heap, Item("a", 2))

print([el.id for el in heap])  # [1, 3, 2]

Duplicates are indeed inserted after the first occurrence! This makes total sense since Python’s heapq is implemented as a binary tree according to their documentation.

So everything is fine: I can now make sure I merge iterators and keep only the relevant record when there are duplicates!