RC: W5 D3 — Merging iterators (part 2)
March 13, 2024My memtables rely on red-black trees, and I have implemented the latter such that, if I insert a record whose key is a duplicate of another record in the tree, I replace it with the new value. Therefore, I am guaranteed to never have any duplicates within the same memtable (here, by “duplicate”, I mean “two records that have the same key”).
However, the records in one memtable may be duplicates of others located in a different memtable. Thus, when merging iterators from multiple memtables, I need to make sure I return only the correct duplicate (i.e. the most recently-inserted record). In order to separate concerns, I created a second function whose responsibility would be to filter out the duplicates. It can be implemented very simply: the first time it encounters a new key, it yields the record, and every duplicate that shows up afterward is ignored.
def filter_duplicates(iterator: Iterator[Record]) -> Iterator[Record]:
previous_item = None
for item in iterator:
if previous_item is not None and item.is_duplicate(previous_item):
continue
yield item
previous_item = item
The filter_duplicates
function takes the merged iterator (containing duplicates) as input and returns an iterator
without duplicates.
To make sure this simple implementation is correct, I need to make sure that the first duplicate in the list is the most
recently-written one.
In other words, I need to make sure that the merge_iterators
function implemented yesterday would yield them in this
order.
Given that my merge_iterators
function uses a heapq
and that
iterators are passed in most to least recent order to it, I made a very simple experiment to check how duplicates are
inserted:
import heapq
class Item:
def __init__(self, key, id):
self.key = key
self.id = id
def __lt__(self, other: "Item"):
return self.key < other.key
heap = []
heapq.heappush(heap, Item("a", 1))
heapq.heappush(heap, Item("a", 3))
heapq.heappush(heap, Item("a", 2))
print([el.id for el in heap]) # [1, 3, 2]
Duplicates are indeed inserted after the first occurrence! This makes total sense since
Python’s heapq
is implemented as a binary tree according
to their documentation.
So everything is fine: I can now make sure I merge iterators and keep only the relevant record when there are duplicates!