RC: W1 D5 — Flush buffer to read!

February 16, 2024

I stumbled upon an interesting bug today, that got me to better understand concepts that had always been very blurry to me, namely: buffers and streams of data.

I was working with Python and was trying to read the content of a file right after writing into it (and before closing it). Here is the (buggy) code I had:

FILE_PATH = "./my-file.txt"

# Write something to a file
file_being_written = open(FILE_PATH, "w+")
file_being_written.write("something")

# Create a second file object reading the same file
file_being_read = open(FILE_PATH, "r")
lines = file_being_read.readlines()
print(lines)  # Prints []
# We can't read the file's content !!!

# Call `.tell()` on the file being written
file_being_written.tell()
lines = file_being_read.readlines()
print(lines)  # Prints ["something"]
# But now, we _can_ read the file's content !!! 🤯

So, when I open a second process to read the file, I can’t read its content, but if I call file.tell() (a method that returns the current position of the file pointer), then I can read it… WAT ??!!??

To understand what happened, it was necessary to better understand what buffers do. There it is: Python has an internal buffer. When we call the write() method, the data is actually written to Python’s internal buffer, not to disk. As a consequence, when I create a second file object reading from the same file (file_being_read), the data is still in Python’s buffer but has not been flushed to disk yet. Thus, the new file object cannot access its content.

However, when I call the tell() method on the first file object (file_being_written), there is a side effect: it flushes the write buffer to disk. Therefore, we can now access and read the file content!

To circumvent the problem, we just need to force the flushing by calling file_being_written.flush() before trying to read the file content. Problem solved!! 🎉

Bonus: Actually, my explanation above is a bit simplified because there is not one single level of buffering, but actually two:

The documentation for the flush() method mentions that this method does not necessarily flush the data to disk. Instead, data may be only flushed from Python’s internal writing buffer to the OS’s buffer. This is enough if we have another process trying to access the file content, but not necessarily enough to persist this on disk (if, say, the computer turns off, data may not have been written to disk). If we want to ensure that the data is actually written to disk whatever happens, we would need to call os.fsync() to flush from the OS’s buffer to disk.