Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote: > >> Yeah, I see why you want to PANIC. > > > > Indeed. Even doing that leaves question marks about all the kernel > > versions before v4.13, which at this point is pretty much everything > > out there, not even detecting this reliably. This is messy. There may still be a way to reliably detect this on older kernel versions from userspace, but it will be messy whatsoever. On EIO errors, the kernel will not restore the dirty page flags, but it will flip the error flags on the failed pages. One could mmap() the file in question, obtain the PFNs (via /proc/pid/pagemap) and enumerate those to match the ones with the error flag switched on (via /proc/kpageflags). This could serve at least as a detection mechanism, but one could also further use this info to logically map the pages that failed IO back to the original file offsets, and potentially retry IO just for those file ranges that cover the failed pages. Just an idea, not tested. Best regards, Anthony
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Sun, Apr 01, 2018 at 12:13:09AM +0800, Craig Ringer wrote: >On 31 March 2018 at 21:24, Anthony Iliopoulos <[1]ail...@altatus.com> >wrote: > > On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote: > > > >> Yeah, I see why you want to PANIC. > > > > > > Indeed. Even doing that leaves question marks about all the kernel > > > versions before v4.13, which at this point is pretty much everything > > > out there, not even detecting this reliably. This is messy. > > There may still be a way to reliably detect this on older kernel > versions from userspace, but it will be messy whatsoever. On EIO > errors, the kernel will not restore the dirty page flags, but it > will flip the error flags on the failed pages. One could mmap() > the file in question, obtain the PFNs (via /proc/pid/pagemap) > and enumerate those to match the ones with the error flag switched > on (via /proc/kpageflags). This could serve at least as a detection > mechanism, but one could also further use this info to logically > map the pages that failed IO back to the original file offsets, > and potentially retry IO just for those file ranges that cover > the failed pages. Just an idea, not tested. > >That sounds like a huge amount of complexity, with uncertainty as to how >it'll behave kernel-to-kernel, for negligble benefit. Those interfaces have been around since the kernel 2.6 times and are rather stable, but I was merely responding to your original post comment regarding having a way of finding out which page(s) failed. I assume that indeed there would be no benefit, especially since those errors are usually not transient (typically they come from hard medium faults), and although a filesystem could theoretically mask the error by allocating a different logical block, I am not aware of any implementation that currently does that. >I was exploring the idea of doing selective recovery of one relfilenode, >based on the assumption that we know the filenode related to the fd that >failed to fsync(). We could redo only WAL on that relation. But it fails >the same test: it's too complex for a niche case that shouldn't happen in >the first place, so it'll probably have bugs, or grow bugs in bitrot over >time. Fully agree, those cases should be sufficiently rare that a complex and possibly non-maintainable solution is not really warranted. >Remember, if you're on ext4 with errors=remount-ro, you get shut down even >harder than a PANIC. So we should just use the big hammer here. I am not entirely sure what you mean here, does Pg really treat write() errors as fatal? Also, the kind of errors that ext4 detects with this option is at the superblock level and govern metadata rather than actual data writes (recall that those are buffered anyway, no actual device IO has to take place at the time of write()). Best regards, Anthony
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote: > Craig Ringer writes: > > So we should just use the big hammer here. > > And bitch, loudly and publicly, about how broken this kernel behavior is. > If we make enough of a stink maybe it'll get fixed. It is not likely to be fixed (beyond what has been done already with the manpage patches and errseq_t fixes on the reporting level). The issue is, the kernel needs to deal with hard IO errors at that level somehow, and since those errors typically persist, re-dirtying the pages would not really solve the problem (unless some filesystem remaps the request to a different block, assuming the device is alive). Keeping around dirty pages that cannot possibly be written out is essentially a memory leak, as those pages would stay around even after the application has exited. Best regards, Anthony
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote: > Hi, > > On 2018-04-01 03:14:46 +0200, Anthony Iliopoulos wrote: > > On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote: > > > Craig Ringer writes: > > > > So we should just use the big hammer here. > > > > > > And bitch, loudly and publicly, about how broken this kernel behavior is. > > > If we make enough of a stink maybe it'll get fixed. > > > > It is not likely to be fixed (beyond what has been done already with the > > manpage patches and errseq_t fixes on the reporting level). The issue is, > > the kernel needs to deal with hard IO errors at that level somehow, and > > since those errors typically persist, re-dirtying the pages would not > > really solve the problem (unless some filesystem remaps the request to a > > different block, assuming the device is alive). > > Throwing away the dirty pages *and* persisting the error seems a lot > more reasonable. Then provide a fcntl (or whatever) extension that can > clear the error status in the few cases that the application that wants > to gracefully deal with the case. Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime). The error reporting is thus consistent with the intended semantics (which are sadly not properly documented). Repeated calls to fsync() simply do not imply that the kernel will retry to writeback the previously-failed pages, so the application needs to be aware of that. Persisting the error at the fsync() level would essentially mean moving application policy into the kernel. Best regards, Anthony
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Mon, Apr 02, 2018 at 12:32:45PM -0700, Andres Freund wrote: > On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote: > > On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote: > > > Throwing away the dirty pages *and* persisting the error seems a lot > > > more reasonable. Then provide a fcntl (or whatever) extension that can > > > clear the error status in the few cases that the application that wants > > > to gracefully deal with the case. > > > > Given precisely that the dirty pages which cannot been written-out are > > practically thrown away, the semantics of fsync() (after the 4.13 fixes) > > are essentially correct: the first call indicates that a writeback error > > indeed occurred, while subsequent calls have no reason to indicate an error > > (assuming no other errors occurred in the meantime). > > Meh^2. > > "no reason" - except that there's absolutely no way to know what state > the data is in. And that your application needs explicit handling of > such failures. And that one FD might be used in a lots of different > parts of the application, that fsyncs in one part of the application > might be an ok failure, and in another not. Requiring explicit actions > to acknowledge "we've thrown away your data for unknown reason" seems > entirely reasonable. As long as fsync() indicates error on first invocation, the application is fully aware that between this point of time and the last call to fsync() data has been lost. Persisting this error any further does not change this or add any new info - on the contrary it adds confusion as subsequent write()s and fsync()s on other pages can succeed, but will be reported as failures. The application will need to deal with that first error irrespective of subsequent return codes from fsync(). Conceptually every fsync() invocation demarcates an epoch for which it reports potential errors, so the caller needs to take responsibility for that particular epoch. Callers that are not affected by the potential outcome of fsync() and do not react on errors, have no reason for calling it in the first place (and thus masking failure from subsequent callers that may indeed care). Best regards, Anthony
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Hi Stephen, On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote: > > fsync() doesn't reflect the status of given pages, however, it reflects > the status of the file descriptor, and as such the file, on which it's > called. This notion that fsync() is actually only responsible for the > changes which were made to a file since the last fsync() call is pure > foolishness. If we were able to pass a list of pages or data ranges to > fsync() for it to verify they're on disk then perhaps things would be > different, but we can't, all we can do is ask to "please flush all the > dirty pages associated with this file descriptor, which represents this > file we opened, to disk, and let us know if you were successful." > > Give us a way to ask "are these specific pages written out to persistant > storage?" and we would certainly be happy to use it, and to repeatedly > try to flush out pages which weren't synced to disk due to some > transient error, and to track those cases and make sure that we don't > incorrectly assume that they've been transferred to persistent storage. Indeed fsync() is simply a rather blunt instrument and a narrow legacy interface but further changing its established semantics (no matter how unreasonable they may be) is probably not the way to go. Would using sync_file_range() be helpful? Potential errors would only apply to pages that cover the requested file ranges. There are a few caveats though: (a) it still messes with the top-level error reporting so mixing it with callers that use fsync() and do care about errors will produce the same issue (clearing the error status). (b) the error-reporting granularity is coarse (failure reporting applies to the entire requested range so you still don't know which particular pages/file sub-ranges failed writeback) (c) the same "report and forget" semantics apply to repeated invocations of the sync_file_range() call, so again action will need to be taken upon first error encountered for the particular ranges. > > The application will need to deal with that first error irrespective of > > subsequent return codes from fsync(). Conceptually every fsync() invocation > > demarcates an epoch for which it reports potential errors, so the caller > > needs to take responsibility for that particular epoch. > > We do deal with that error- by realizing that it failed and later > *retrying* the fsync(), which is when we get back an "all good! > everything with this file descriptor you've opened is sync'd!" and > happily expect that to be truth, when, in reality, it's an unfortunate > lie and there are still pages associated with that file descriptor which > are, in reality, dirty and not sync'd to disk. It really turns out that this is not how the fsync() semantics work though, exactly because the nature of the errors: even if the kernel retained the dirty bits on the failed pages, retrying persisting them on the same disk location would simply fail. Instead the kernel opts for marking those pages clean (since there is no other recovery strategy), and reporting once to the caller who can potentially deal with it in some manner. It is sadly a bad and undocumented convention. > Consider two independent programs where the first one writes to a file > and then calls the second one whose job it is to go out and fsync(), > perhaps async from the first, those files. Is the second program > supposed to go write to each page that the first one wrote to, in order > to ensure that all the dirty bits are set so that the fsync() will > actually return if all the dirty pages are written? I think what you have in mind are the semantics of sync() rather than fsync(), but as long as an application needs to ensure data are persisted to storage, it needs to retain those data in its heap until fsync() is successful instead of discarding them and relying on the kernel after write(). The pattern should be roughly like: write() -> fsync() -> free(), rather than write() -> free() -> fsync(). For example, if a partition gets full upon fsync(), then the application has a chance to persist the data in a different location, while the kernel cannot possibly make this decision and recover. > > Callers that are not affected by the potential outcome of fsync() and > > do not react on errors, have no reason for calling it in the first place > > (and thus masking failure from subsequent callers that may indeed care). > > Reacting on an error from an fsync() call could, based on how it's > documented and actually implemented in other OS's, mean "run another > fsync() to see if the error has resolved itself." Requiring that to > mean "you have to go dirty all of the pages you previously dirtied to > actually get a subsequent fsync() to do anything" is really just not > reasonable- a given program may have no idea what was written to > previously nor any particular reason to need to know, on the expectation > that the fsync() call will flush any dirty pages, as it's documente
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Hi Robert, On Mon, Apr 02, 2018 at 10:54:26PM -0400, Robert Haas wrote: > On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos wrote: > > Given precisely that the dirty pages which cannot been written-out are > > practically thrown away, the semantics of fsync() (after the 4.13 fixes) > > are essentially correct: the first call indicates that a writeback error > > indeed occurred, while subsequent calls have no reason to indicate an error > > (assuming no other errors occurred in the meantime). > > Like other people here, I think this is 100% unreasonable, starting > with "the dirty pages which cannot been written out are practically > thrown away". Who decided that was OK, and on the basis of what > wording in what specification? I think it's always unreasonable to If you insist on strict conformance to POSIX, indeed the linux glibc configuration and associated manpage are probably wrong in stating that _POSIX_SYNCHRONIZED_IO is supported. The implementation matches that of the flexibility allowed by not supporting SIO. There's a long history of brokenness between linux and posix, and I think there was never an intention of conforming to the standard. > throw away the user's data. If the writes are going to fail, then let > them keep on failing every time. *That* wouldn't cause any data loss, > because we'd never be able to checkpoint, and eventually the user > would have to kill the server uncleanly, and that would trigger > recovery. I believe (as tried to explain earlier) there is a certain assumption being made that the writer and original owner of data is responsible for dealing with potential errors in order to avoid data loss (which should be only of interest to the original writer anyway). It would be very questionable for the interface to persist the error while subsequent writes and fsyncs to different offsets may as well go through. Another process may need to write into the file and fsync, while being unaware of those newly introduced semantics is now faced with EIO because some unrelated previous process failed some earlier writes and did not bother to clear the error for those writes. In a similar scenario where the second process is aware of the new semantics, it would naturally go ahead and clear the global error in order to proceed with its own write()+fsync(), which would essentially amount to the same problematic semantics you have now. > Also, this really does make it impossible to write reliable programs. > Imagine that, while the server is running, somebody runs a program > which opens a file in the data directory, calls fsync() on it, and > closes it. If the fsync() fails, postgres is now borked and has no > way of being aware of the problem. If we knew, we could PANIC, but > we'll never find out, because the unrelated process ate the error. > This is exactly the sort of ill-considered behavior that makes fcntl() > locking nearly useless. Fully agree, and the errseq_t fixes have dealt exactly with the issue of making sure that the error is reported to all file descriptors that *happen to be open at the time of error*. But I think one would have a hard time defending a modification to the kernel where this is further extended to cover cases where: process A does write() on some file offset which fails writeback, fsync() gets EIO and exit()s. process B does write() on some other offset which succeeds writeback, but fsync() gets EIO due to (uncleared) failures of earlier process. This would be a highly user-visible change of semantics from edge- triggered to level-triggered behavior. > dodge this issue in another way: suppose that when we write a page > out, we don't consider it really written until fsync() succeeds. Then That's the only way to think about fsync() guarantees unless you are on a kernel that keeps retrying to persist dirty pages. Assuming such a model, after repeated and unrecoverable hard failures the process would have to explicitly inform the kernel to drop the dirty pages. All the process could do at that point is read back to userspace the dirty/failed pages and attempt to rewrite them at a different place (which is current possible too). Most applications would not bother though to inform the kernel and drop the permanently failed pages; and thus someone eventually would hit the case that a large amount of failed writeback pages are running his server out of memory, at which point people will complain that those semantics are completely unreasonable. > we wouldn't need to PANIC if an fsync() fails; we could just re-write > the page. Unfortunately, this would also be terrible for performance, > for pretty much the same reasons: letting the OS cache absorb lots of > dirty blocks and do write-combining is *necessary* for good > performance. Not sure I understand this case. The applic
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Tue, Apr 03, 2018 at 12:26:05PM +0100, Greg Stark wrote: > On 3 April 2018 at 11:35, Anthony Iliopoulos wrote: > > Hi Robert, > > > > Fully agree, and the errseq_t fixes have dealt exactly with the issue > > of making sure that the error is reported to all file descriptors that > > *happen to be open at the time of error*. But I think one would have a > > hard time defending a modification to the kernel where this is further > > extended to cover cases where: > > > > process A does write() on some file offset which fails writeback, > > fsync() gets EIO and exit()s. > > > > process B does write() on some other offset which succeeds writeback, > > but fsync() gets EIO due to (uncleared) failures of earlier process. > > > Surely that's exactly what process B would want? If it calls fsync and > gets a success and later finds out that the file is corrupt and didn't > match what was in memory it's not going to be happy. You can't possibly make this assumption. Process B may be reading and writing to completely disjoint regions from those of process A, and as such not really caring about earlier failures, only wanting to ensure its own writes go all the way through. But even if it did care, the file interfaces make no transactional guarantees. Even without fsync() there is nothing preventing process B from reading dirty pages from process A, and based on their content proceed to to its own business and write/persist new data, while process A further modifies the not-yet-flushed pages in-memory before flushing. In this case you'd need explicit synchronization/locking between the processes anyway, so why would fsync() be an exception? > This seems like an attempt to co-opt fsync for a new and different > purpose for which it's poorly designed. It's not an async error > reporting mechanism for writes. It would be useless as that as any > process could come along and open your file and eat the errors for > writes you performed. An async error reporting mechanism would have to > document which writes it was giving errors for and give you ways to > control that. The errseq_t fixes deal with that; errors will be reported to any process that has an open fd, irrespective to who is the actual caller of the fsync() that may have induced errors. This is anyway required as the kernel may evict dirty pages on its own by doing writeback and as such there needs to be a way to report errors on all open fds. > The semantics described here are useless for everyone. For a program > needing to know the error status of the writes it executed, it doesn't > know which writes are included in which fsync call. For a program If EIO persists between invocations until explicitly cleared, a process cannot possibly make any decision as to if it should clear the error and proceed or some other process will need to leverage that without coordination, or which writes actually failed for that matter. We would be back to the case of requiring explicit synchronization between processes that care about this, in which case the processes may as well synchronize over calling fsync() in the first place. Having an opt-in persisting EIO per-fd would practically be a form of "contract" between "cooperating" processes anyway. But instead of deconstructing and debating the semantics of the current mechanism, why not come up with the ideal desired form of error reporting/tracking granularity etc., and see how this may be fitted into kernels as a new interface. Best regards, Anthony
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Tue, Apr 03, 2018 at 03:37:30PM +0100, Greg Stark wrote: > On 3 April 2018 at 14:36, Anthony Iliopoulos wrote: > > > If EIO persists between invocations until explicitly cleared, a process > > cannot possibly make any decision as to if it should clear the error > > I still don't understand what "clear the error" means here. The writes > still haven't been written out. We don't care about tracking errors, > we just care whether all the writes to the file have been flushed to > disk. By "clear the error" you mean throw away the dirty pages and > revert part of the file to some old data? Why would anyone ever want > that? It means that the responsibility of recovering the data is passed back to the application. The writes may never be able to be written out. How would a kernel deal with that? Either discard the data (and have the writer acknowledge) or buffer the data until reboot and simply risk going OOM. It's not what someone would want, but rather *need* to deal with, one way or the other. At least on the application-level there's a fighting chance for restoring to a consistent state. The kernel does not have that opportunity. > > But instead of deconstructing and debating the semantics of the > > current mechanism, why not come up with the ideal desired form of > > error reporting/tracking granularity etc., and see how this may be > > fitted into kernels as a new interface. > > Because Postgres is portable software that won't be able to use some > Linux-specific interface. And doesn't really need any granular error I don't really follow this argument, Pg is admittedly using non-portable interfaces (e.g the sync_file_range()). While it's nice to avoid platform specific hacks, expecting that the POSIX semantics will be consistent across systems is simply a 90's pipe dream. While it would be lovely to have really consistent interfaces for application writers, this is simply not going to happen any time soon. And since those problematic semantics of fsync() appear to be prevalent in other systems as well that are not likely to be changed, you cannot rely on preconception that once buffers are handed over to kernel you have a guarantee that they will be eventually persisted no matter what. (Why even bother having fsync() in that case? The kernel would eventually evict and writeback dirty pages anyway. The point of reporting the error back to the application is to give it a chance to recover - the kernel could repeat "fsync()" itself internally if this would solve anything). > reporting system anyways. It just needs to know when all writes have > been synced to disk. Well, it does know when *some* writes have *not* been synced to disk, exactly because the responsibility is passed back to the application. I do realize this puts more burden back to the application, but what would a viable alternative be? Would you rather have a kernel that risks periodically going OOM due to this design decision? Best regards, Anthony
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote: > On 8 April 2018 at 04:27, Craig Ringer wrote: > > On 8 April 2018 at 10:16, Thomas Munro > > wrote: > > > > If the kernel does writeback in the middle, how on earth is it supposed to > > know we expect to reopen the file and check back later? > > > > Should it just remember "this file had an error" forever, and tell every > > caller? In that case how could we recover? We'd need some new API to say > > "yeah, ok already, I'm redoing all my work since the last good fsync() so > > you can clear the error flag now". Otherwise it'd keep reporting an error > > after we did redo to recover, too. > > There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep > track of any errors. We just need fsync to accurately report whether > all the buffers in the file have been written out. When you call fsync Instead, fsync() reports when some of the buffers have not been written out, due to reasons outlined before. As such it may make some sense to maintain some tracking regarding errors even after marking failed dirty pages as clean (in fact it has been proposed, but this introduces memory overhead). > again the kernel needs to initiate i/o on all the dirty buffers and > block until they complete successfully. If they complete successfully > then nobody cares whether they had some failure in the past when i/o > was initiated at some point in the past. The question is, what should the kernel and application do in cases where this is simply not possible (according to freebsd that keeps dirty pages around after failure, for example, -EIO from the block layer is a contract for unrecoverable errors so it is pointless to keep them dirty). You'd need a specialized interface to clear-out the errors (and drop the dirty pages), or potentially just remount the filesystem. > The problem is not that errors aren't been tracked correctly. The > problem is that dirty buffers are being marked clean when they haven't > been written out. They consider dirty filesystem buffers when there's > hardware failure preventing them from being written "a memory leak". > > As long as any error means the kernel has discarded writes then > there's no real hope of any reliable operation through that interface. This does not necessarily follow. Whether the kernel discards writes or not would not really help (see above). It is more a matter of proper "reporting contract" between userspace and kernel, and tracking would be a way for facilitating this vs. having a more complex userspace scheme (as described by others in this thread) where synchronization for fsync() is required in a multi-process application. Best regards, Anthony
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Mon, Apr 09, 2018 at 09:45:40AM +0100, Greg Stark wrote: > On 8 April 2018 at 22:47, Anthony Iliopoulos wrote: > > On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote: > >> On 8 April 2018 at 04:27, Craig Ringer wrote: > >> > On 8 April 2018 at 10:16, Thomas Munro > > > > The question is, what should the kernel and application do in cases > > where this is simply not possible (according to freebsd that keeps > > dirty pages around after failure, for example, -EIO from the block > > layer is a contract for unrecoverable errors so it is pointless to > > keep them dirty). You'd need a specialized interface to clear-out > > the errors (and drop the dirty pages), or potentially just remount > > the filesystem. > > Well firstly that's not necessarily the question. ENOSPC is not an > unrecoverable error. And even unrecoverable errors for a single write > doesn't mean the write will never be able to succeed in the future. To make things a bit simpler, let us focus on EIO for the moment. The contract between the block layer and the filesystem layer is assumed to be that of, when an EIO is propagated up to the fs, then you may assume that all possibilities for recovering have been exhausted in lower layers of the stack. Mind you, I am not claiming that this contract is either documented or necessarily respected (in fact there have been studies on the error propagation and handling of the block layer, see [1]). Let us assume that this is the design contract though (which appears to be the case across a number of open-source kernels), and if not - it's a bug. In this case, indeed the specific write()s will never be able to succeed in the future, at least not as long as the BIOs are allocated to the specific failing LBAs. > But secondly doesn't such an interface already exist? When the device > is dropped any dirty pages already get dropped with it. What's the > point in dropping them but keeping the failing device? I think there are degrees of failure. There are certainly cases where one may encounter localized unrecoverable medium errors (specific to certain LBAs) that are non-maskable from the block layer and below. That does not mean that the device is dropped at all, so it does make sense to continue all other operations to all other regions of the device that are functional. In cases of total device failure, then the filesystem will prevent you from proceeding anyway. > But just to underline the point. "pointless to keep them dirty" is > exactly backwards from the application's point of view. If the error > writing to persistent media really is unrecoverable then it's all the > more critical that the pages be kept so the data can be copied to some > other device. The last thing user space expects to happen is if the > data can't be written to persistent storage then also immediately > delete it from RAM. (And the *really* last thing user space expects is > for this to happen and return no error.) Right. This implies though that apart from the kernel having to keep around the dirtied-but-unrecoverable pages for an unbounded time, that there's further an interface for obtaining the exact failed pages so that you can read them back. This in turn means that there needs to be an association between the fsync() caller and the specific dirtied pages that the caller intents to drain (for which we'd need an fsync_range(), among other things). BTW, currently the failed writebacks are not dropped from memory, but rather marked clean. They could be lost though due to memory pressure or due to explicit request (e.g. proc drop_caches), unless mlocked. There is a clear responsibility of the application to keep its buffers around until a successful fsync(). The kernels do report the error (albeit with all the complexities of dealing with the interface), at which point the application may not assume that the write()s where ever even buffered in the kernel page cache in the first place. What you seem to be asking for is the capability of dropping buffers over the (kernel) fence and idemnifying the application from any further responsibility, i.e. a hard assurance that either the kernel will persist the pages or it will keep them around till the application recovers them asynchronously, the filesystem is unmounted, or the system is rebooted. Best regards, Anthony [1] https://www.usenix.org/legacy/event/fast08/tech/full_papers/gunawi/gunawi.pdf
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Mon, Apr 09, 2018 at 01:03:28PM +0100, Geoff Winkless wrote: > On 9 April 2018 at 11:50, Anthony Iliopoulos wrote: > > > What you seem to be asking for is the capability of dropping > > buffers over the (kernel) fence and idemnifying the application > > from any further responsibility, i.e. a hard assurance > > that either the kernel will persist the pages or it will > > keep them around till the application recovers them > > asynchronously, the filesystem is unmounted, or the system > > is rebooted. > > > > That seems like a perfectly reasonable position to take, frankly. Indeed, as long as you are willing to ignore the consequences of this design decision: mainly, how you would recover memory when no application is interested in clearing the error. At which point other applications with different priorities will find this position rather unreasonable since there can be no way out of it for them. Good luck convincing any OS kernel upstream to go with this design. > The whole _point_ of an Operating System should be that you can do exactly > that. As a developer I should be able to call write() and fsync() and know > that if both calls have succeeded then the result is on disk, no matter > what another application has done in the meantime. If that's a "difficult" > problem then that's the OS's problem, not mine. If the OS doesn't do that, > it's _not_doing_its_job_. No OS kernel that I know of provides any promises for atomicity of a write()+fsync() sequence, unless one is using O_SYNC. It doesn't provide you with isolation either, as this is delegated to userspace, where processes that share a file should coordinate accordingly. It's not a difficult problem, but rather the kernels provide a common denominator of possible interfaces and designs that could accommodate a wider range of potential application scenarios for which the kernel cannot possibly anticipate requirements. There have been plenty of experimental works for providing a transactional (ACID) filesystem interface to applications. On the opposite end, there have been quite a few commercial databases that completely bypass the kernel storage stack. But I would assume it is reasonable to figure out something between those two extremes that can work in a "portable" fashion. Best regards, Anthony
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Mon, Apr 09, 2018 at 08:16:38PM +0800, Craig Ringer wrote: > > I'd like a middle ground where the kernel lets us register our interest and > tells us if it lost something, without us having to keep eight million FDs > open for some long period. "Tell us about anything that happens under > pgdata/" or an inotify-style per-directory-registration option. I'd even > say that's ideal. I see what you are saying. So basically you'd always maintain the notification descriptor open, where the kernel would inject events related to writeback failures of files under watch (potentially enriched to contain info regarding the exact failed pages and the file offset they map to). The kernel wouldn't even have to maintain per-page bits to trace the errors, since they will be consumed by the process that reads the events (or discarded, when the notification fd is closed). Assuming this would be possible, wouldn't Pg still need to deal with synchronizing writers and related issues (since this would be merely a notification mechanism - not prevent any process from continuing), which I understand would be rather intrusive for the current Pg multi-process design. But other than that, similarly this interface could in principle be similarly implemented in the BSDs via kqueue(), I suppose, to provide what you need. Best regards, Anthony
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote: > > We already have dirty_bytes and dirty_background_bytes, for example. I > don't see why there couldn't be another limit defining how much dirty > data to allow before blocking writes altogether. I'm sure it's not that > simple, but you get the general idea - do not allow using all available > memory because of writeback issues, but don't throw the data away in > case it's just a temporary issue. Sure, there could be knobs for limiting how much memory such "zombie" pages may occupy. Not sure how helpful it would be in the long run since this tends to be highly application-specific, and for something with a large data footprint one would end up tuning this accordingly in a system-wide manner. This has the potential to leave other applications running in the same system with very little memory, in cases where for example original application crashes and never clears the error. Apart from that, further interfaces would need to be provided for actually dealing with the error (again assuming non-transient issues that may not be fixed transparently and that temporary issues are taken care of by lower layers of the stack). > Well, there seem to be kernels that seem to do exactly that already. At > least that's how I understand what this thread says about FreeBSD and > Illumos, for example. So it's not an entirely insane design, apparently. It is reasonable, but even FreeBSD has a big fat comment right there (since 2017), mentioning that there can be no recovery from EIO at the block layer and this needs to be done differently. No idea how an application running on top of either FreeBSD or Illumos would actually recover from this error (and clear it out), other than remounting the fs in order to force dropping of relevant pages. It does provide though indeed a persistent error indication that would allow Pg to simply reliably panic. But again this does not necessarily play well with other applications that may be using the filesystem reliably at the same time, and are now faced with EIO while their own writes succeed to be persisted. Ideally, you'd want a (potentially persistent) indication of error localized to a file region (mapping the corresponding failed writeback pages). NetBSD is already implementing fsync_ranges(), which could be a step in the right direction. > One has to wonder how many applications actually use this correctly, > considering PostgreSQL cares about data durability/consistency so much > and yet we've been misunderstanding how it works for 20+ years. I would expect it would be very few, potentially those that have a very simple process model (e.g. embedded DBs that can abort a txn on fsync() EIO). I think that durability is a rather complex cross-layer issue which has been grossly misunderstood similarly in the past (e.g. see [1]). It seems that both the OS and DB communities greatly benefit from a periodic reality check, and I see this as an opportunity for strengthening the IO stack in an end-to-end manner. Best regards, Anthony [1] https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Mon, Apr 09, 2018 at 04:29:36PM +0100, Greg Stark wrote: > Honestly I don't think there's *any* way to use the current interface > to implement reliable operation. Even that embedded database using a > single process and keeping every file open all the time (which means > file descriptor limits limit its scalability) can be having silent > corruption whenever some other process like a backup program comes > along and calls fsync (or even sync?). That is indeed true (sync would induce fsync on open inodes and clear the error), and that's a nasty bug that apparently went unnoticed for a very long time. Hopefully the errseq_t linux 4.13 fixes deal with at least this issue, but similar fixes need to be adopted by many other kernels (all those that mark failed pages as clean). I honestly do not expect that keeping around the failed pages will be an acceptable change for most kernels, and as such the recommendation will probably be to coordinate in userspace for the fsync(). What about having buffered IO with implied fsync() atomicity via O_SYNC? This would probably necessitate some helper threads that mask the latency and present an async interface to the rest of PG, but sounds less intrusive than going for DIO. Best regards, Anthony
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Mon, Apr 09, 2018 at 12:29:16PM -0700, Andres Freund wrote: > On 2018-04-09 21:26:21 +0200, Anthony Iliopoulos wrote: > > What about having buffered IO with implied fsync() atomicity via > > O_SYNC? > > You're kidding, right? We could also just add sleep(30)'s all over the > tree, and hope that that'll solve the problem. There's a reason we > don't permanently fsync everything. Namely that it'll be way too slow. I am assuming you can apply the same principle of selectively using O_SYNC at times and places that you'd currently actually call fsync(). Also assuming that you'd want to have a backwards-compatible solution for all those kernels that don't keep the pages around, irrespective of future fixes. Short of loading a kernel module and dealing with the problem directly, the only other available options seem to be either O_SYNC, O_DIRECT or ignoring the issue. Best regards, Anthony
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On Mon, Apr 09, 2018 at 12:37:03PM -0700, Andres Freund wrote: > > > On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos > wrote: > > >I honestly do not expect that keeping around the failed pages will > >be an acceptable change for most kernels, and as such the > >recommendation > >will probably be to coordinate in userspace for the fsync(). > > Why is that required? You could very well just keep per inode information > about fatal failures that occurred around. Report errors until that bit is > explicitly cleared. Yes, that keeps some memory around until unmount if > nobody clears it. But it's orders of magnitude less, and results in usable > semantics. As discussed before, I think this could be acceptable, especially if you pair it with an opt-in mechanism (only applications that care to deal with this will have to), and would give it a shot. Still need a way to deal with all other systems and prior kernel releases that are eating fsync() writeback errors even over sync(). Best regards, Anthony
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Hi Robert, On Tue, Apr 10, 2018 at 11:15:46AM -0400, Robert Haas wrote: > On Mon, Apr 9, 2018 at 3:13 PM, Andres Freund wrote: > > Let's lower the pitchforks a bit here. Obviously a grand rewrite is > > absurd, as is some of the proposed ways this is all supposed to > > work. But I think the case we're discussing is much closer to a near > > irresolvable corner case than anything else. > > Well, I admit that I wasn't entirely serious about that email, but I > wasn't entirely not-serious either. If you can't find reliably find > out whether the contents of the file on disk are the same as the > contents that the kernel is giving you when you call read(), then you > are going to have a heck of a time building a reliable system. If the > kernel developers are determined to insist on these semantics (and, > admittedly, I don't know whether that's the case - I've only read > Anthony's remarks), then I don't really see what we can do except give > up on buffered I/O (or on Linux). I think it would be interesting to get in touch with some of the respective linux kernel maintainers and open up this topic for more detailed discussions. LSF/MM'18 is upcoming and it would have been the perfect opportunity but it's past the CFP deadline. It may still worth contacting the organizers to bring forward the issue, and see if there is a chance to have someone from Pg invited for further discussions. Best regards, Anthony