RC: W9 D3 — Serializing JSON records compactly with ProtoBuf
April 10, 2024Yesterday, I serialized my flush and compaction events by writing a custom schema to encode and decode them. Since encoded events are of variable length, I needed to prepend every record with its size in bytes so that I could decode the correct chunk afterward. This method works very well, but it was very tedious to design and implement. So I wondered if I could have done it in another way that would have been faster to implement.
An alternative consists in formatting events into JSON records, separated by newline characters.
This has the advantage of being extremely easy to implement, both for encoding and decoding: I would only need to format
my data as a JSON object and then use the json
library to encode and decode it, by separating each event by a newline
character for instance. So instead of spending hours to think of the correct encoding schema and implementing it, it
could all have been done in 30 minutes with this method.
It would be something as simple as:
import json
# Define events
flush_event = {
"sstable": "sstable2.sst",
}
compaction_event = {
"input_sstables": ["sstable1.sst", "sstable2.sst"],
"output_sstables": ["sstable3.sst"],
"level": 1,
}
events = [flush_event, compaction_event]
# Write events to manifest
with open('manifest.txt', 'wb') as file:
for event in events:
json_record = json.dumps(event) # Convert to JSON string
json_bytes = json_record.encode('utf-8') # Convert to bytes
file.write(json_bytes + b'\n')
# Read events from manifest
with open('manifest.txt', 'wr') as file:
events = [json.loads(line.decode('utf-8')) for line in file]
However, this method would use a lot of disk space since it encodes field names as well as values.
To avoid excessive disk usage, instead of JSON, I could use a more compact serialization format like Protocol Buffers (nicknamed “ProtoBuf”) or Avro. For both these tools, the idea to make serialization more compact is the same: instead of writing all the fields for every record, we define the schema once and only write the values in the records. Then, when deserializing, the schema is read to know which bytes correspond to which fields and the data can be decoded.
As far as I understand, the problem with these two tools is that they serialize a whole file at once. So I could not use them to write events one by one while the operations are being done on the engine. However, I discovered FlatBuffers and Cap’n Proto that seem to suit exactly the need to append records one by one when triggered. Both of these tools have a python library associated: flatbuffers and pycapnp respectively.
From what I have read, the flow is the same for both tools:
- Install the tool and the Python dependency;
- Define a schema;
- Compile the schema so that the definitions are translated into the target language (Python in this case);
- Use the schema in a Python script to serialize to a file.
For example, with FlatBuffers, I could define a simplified event with only category (flush or compact) and SSTable as follows:
namespace Manifest_Schema;
table Event {
category: string;
sstable: string;
}
After compiling it, there is a number of available methods that are automatically generated and can be used to build the
event (EventAddCategory
and EventAddSstable
in this case).
With those, it is easy to write a small script that serializes the event:
import Manifest_Schema.Event as Event
import flatbuffers
def write_event_to_file(flush_event, filename):
# Initialize buffer
builder = flatbuffers.Builder(1024)
# Serialize strings
category = builder.CreateString("FLUSH")
sstable = builder.CreateString(flush_event.sstable)
# Build the Event object
Event.EventStart(builder)
Event.EventAddCategory(builder, category)
Event.EventAddSstable(builder, sstable)
event = Event.EventEnd(builder)
builder.Finish(event)
# Append to file
with open(filename, 'ab') as f:
f.write(builder.Output())
So this is fairly easy to do with an external library and next time, it would definitely be a valid option to explore!