So, what can we actually do about this new Linux behaviour?

Idea 1:

* whenever you open a file, either tell the checkpointer so it can
open it too (and wait for it to tell you that it has done so, because
it's not safe to write() until then), or send it a copy of the file
descriptor via IPC (since duplicated file descriptors share the same
f_wb_err)

* if the checkpointer can't take any more file descriptors (how would
that limit even work in the IPC case?), then it somehow needs to tell
you that so that you know that you're responsible for fsyncing that
file yourself, both on close (due to fd cache recycling) and also when
the checkpointer tells you to

Maybe it could be made to work, but sheesh, that seems horrible.  Is
there some simpler idea along these lines that could make sure that
fsync() is only ever called on file descriptors that were opened
before all unflushed writes, or file descriptors cloned from such file
descriptors?

Idea 2:

Give up, complain that this implementation is defective and
unworkable, both on POSIX-compliance grounds and on POLA grounds, and
campaign to get it fixed more fundamentally (actual details left to
the experts, no point in speculating here, but we've seen a few
approaches that work on other operating systems including keeping
buffers dirty and marking the whole filesystem broken/read-only).

Idea 3:

Give up on buffered IO and develop an O_SYNC | O_DIRECT based system ASAP.

Any other ideas?

For a while I considered suggesting an idea which I now think doesn't
work.  I thought we could try asking for a new fcntl interface that
spits out wb_err counter.  Call it an opaque error token or something.
Then we could store it in our fsync queue and safely close the file.
Check again before fsync()ing, and if we ever see a different value,
PANIC because it means a writeback error happened while we weren't
looking.  Sadly I think it doesn't work because AIUI inodes are not
pinned in kernel memory when no one has the file open and there are no
dirty buffers, so I think the counters could go away and be reset.
Perhaps you could keep inodes pinned by keeping the associated buffers
dirty after an error (like FreeBSD), but if you did that you'd have
solved the problem already and wouldn't really need the wb_err system
at all.  Is there some other idea long these lines that could work?

-- 
Thomas Munro
http://www.enterprisedb.com

Reply via email to