On Tue, Apr 03, 2018 at 03:37:30PM +0100, Greg Stark wrote: > On 3 April 2018 at 14:36, Anthony Iliopoulos <ail...@altatus.com> wrote: > > > If EIO persists between invocations until explicitly cleared, a process > > cannot possibly make any decision as to if it should clear the error > > I still don't understand what "clear the error" means here. The writes > still haven't been written out. We don't care about tracking errors, > we just care whether all the writes to the file have been flushed to > disk. By "clear the error" you mean throw away the dirty pages and > revert part of the file to some old data? Why would anyone ever want > that?
It means that the responsibility of recovering the data is passed back to the application. The writes may never be able to be written out. How would a kernel deal with that? Either discard the data (and have the writer acknowledge) or buffer the data until reboot and simply risk going OOM. It's not what someone would want, but rather *need* to deal with, one way or the other. At least on the application-level there's a fighting chance for restoring to a consistent state. The kernel does not have that opportunity. > > But instead of deconstructing and debating the semantics of the > > current mechanism, why not come up with the ideal desired form of > > error reporting/tracking granularity etc., and see how this may be > > fitted into kernels as a new interface. > > Because Postgres is portable software that won't be able to use some > Linux-specific interface. And doesn't really need any granular error I don't really follow this argument, Pg is admittedly using non-portable interfaces (e.g the sync_file_range()). While it's nice to avoid platform specific hacks, expecting that the POSIX semantics will be consistent across systems is simply a 90's pipe dream. While it would be lovely to have really consistent interfaces for application writers, this is simply not going to happen any time soon. And since those problematic semantics of fsync() appear to be prevalent in other systems as well that are not likely to be changed, you cannot rely on preconception that once buffers are handed over to kernel you have a guarantee that they will be eventually persisted no matter what. (Why even bother having fsync() in that case? The kernel would eventually evict and writeback dirty pages anyway. The point of reporting the error back to the application is to give it a chance to recover - the kernel could repeat "fsync()" itself internally if this would solve anything). > reporting system anyways. It just needs to know when all writes have > been synced to disk. Well, it does know when *some* writes have *not* been synced to disk, exactly because the responsibility is passed back to the application. I do realize this puts more burden back to the application, but what would a viable alternative be? Would you rather have a kernel that risks periodically going OOM due to this design decision? Best regards, Anthony