So, what can we actually do about this new Linux behaviour? Idea 1:
* whenever you open a file, either tell the checkpointer so it can open it too (and wait for it to tell you that it has done so, because it's not safe to write() until then), or send it a copy of the file descriptor via IPC (since duplicated file descriptors share the same f_wb_err) * if the checkpointer can't take any more file descriptors (how would that limit even work in the IPC case?), then it somehow needs to tell you that so that you know that you're responsible for fsyncing that file yourself, both on close (due to fd cache recycling) and also when the checkpointer tells you to Maybe it could be made to work, but sheesh, that seems horrible. Is there some simpler idea along these lines that could make sure that fsync() is only ever called on file descriptors that were opened before all unflushed writes, or file descriptors cloned from such file descriptors? Idea 2: Give up, complain that this implementation is defective and unworkable, both on POSIX-compliance grounds and on POLA grounds, and campaign to get it fixed more fundamentally (actual details left to the experts, no point in speculating here, but we've seen a few approaches that work on other operating systems including keeping buffers dirty and marking the whole filesystem broken/read-only). Idea 3: Give up on buffered IO and develop an O_SYNC | O_DIRECT based system ASAP. Any other ideas? For a while I considered suggesting an idea which I now think doesn't work. I thought we could try asking for a new fcntl interface that spits out wb_err counter. Call it an opaque error token or something. Then we could store it in our fsync queue and safely close the file. Check again before fsync()ing, and if we ever see a different value, PANIC because it means a writeback error happened while we weren't looking. Sadly I think it doesn't work because AIUI inodes are not pinned in kernel memory when no one has the file open and there are no dirty buffers, so I think the counters could go away and be reset. Perhaps you could keep inodes pinned by keeping the associated buffers dirty after an error (like FreeBSD), but if you did that you'd have solved the problem already and wouldn't really need the wb_err system at all. Is there some other idea long these lines that could work? -- Thomas Munro http://www.enterprisedb.com