Hello, We corrected our previously held belief that it's safe to retry fsync(). We still haven't dealt with the possibility that some kernels can forget about write-back errors due to inode cache pressure, if there is a time when no process has the file open. To prevent that we need to hold dirty files open, somehow, until fsync() is called.
The leading idea so far is a scheme to keep the fd that performed the oldest write around by passing it through a Unix domain socket to the checkpointer. Among the problems with that are (1) there's an as-yet unresolved deadlock problem, but let's call that just a SMoP, (2) it relies on atomic Unix socket messages without any guarantee from the OS, (3) it relies on being able to keep fds in the Unix domain socket (not counted by PostgreSQL), admittedly with socket buffer size as back pressure, (4) it relies on the somewhat obscure and badly documented fd-passing support to work the way we think it works and without bugs on every OS, (5) it's too complicated to back-patch. In this thread I'd like to discuss a simpler alternative. A more obvious approach that probably moves us closer to the way kernel developers expect us to write programs is to call fsync() before close() (due to vfd pressure) if you've written. The obvious problem with that is that you could finish up doing loads more fsyncing than we're doing today if you're regularly dirtying more than max_files_per_process in the same backend. I wonder if we could make it more acceptable like so: * when performing writes, record the checkpointer's cycle counter in the File * when closing due to vfd pressure, only call fsync() if the cycle hasn't advanced (that is, you get to skip the fsync() if you haven't written in this sync cycle, because the checkpointer must have taken care of your writes from the previous cycle) * if you find you're doing this too much (by default, dirtying more than 1k relation segments per checkpoint cycle), maybe log a warning that the user might want to consider increasing max_files_per_process (and related OS limits) A possible improvement, stolen from the fd-passing patch, is to introduce a "sync cycle", separate from checkpoint cycle, so that the checkpointer can process its table of pending operations more than once per checkpoint if that would help minimise fsyncing in foreground processes. I haven't thought much about how exactly that would work. There is a less obvious problem with this scheme though: 1. Suppose the operating system has a single error flag for a file no matter how many fds have it open, and it is cleared by the first syscall to return EIO. There is a broken schedule like this (B = regular backend, C = checkpointer): B: fsync() -> EIO # clears error flag C: fsync() -> success, log checkpoint B: PANIC! 2. On an operating system that has an error counter + seen flag (Linux 4.14+), in general you receive the error in all fds, which is a very nice property, but we'd still have a broken schedule involving the seen flag: B: fsync() -> EIO # clears seen flag C: open() # opened after error C: fsync() -> success, log checkpoint B: PANIC! Here's one kind of interlocking that might work: Hash pathnames and map to an array of lwlocks + error flags. Any process trying to sync a file must hold the lock and check for a pre-existing error flag. Now a checkpoint cannot succeed if any backend has recently decided to panic. You could skip that if data_sync_retry = on. -- Thomas Munro https://enterprisedb.com