On Fri, Nov 27, 2015 at 8:17 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > So, what's going on? The problem is that while the rename() is atomic, it's > not guaranteed to be durable without an explicit fsync on the parent > directory. And by default we only do fdatasync on the recycled segments, > which may not force fsync on the directory (and ext4 does not do that, > apparently).
Yeah, that seems to be the way the POSIX spec clears things. "If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state. All I/O operations shall be completed as defined for synchronized I/O file integrity completion." http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html If I understand that right, it is guaranteed that the rename() will be atomic, meaning that there will be only one file even if there is a crash, but that we need to fsync() the parent directory as mentioned. > FWIW this has nothing to do with storage reliability - you may have good > drives, RAID controller with BBU, reliable SSDs or whatever, and you're > still not safe. This issue is at the filesystem level, not storage. The POSIX spec authorizes this behavior, so the FS is not to blame, clearly. At least that's what I get from it. > I've done the same tests on xfs and that seems to be safe - I've been unable > to reproduce the issue, so either the issue is not there or it's more > difficult to hit it. I haven't tried on other file systems, because ext4 and > xfs cover vast majority of deployments (at least on Linux), and thus issue > on ext4 is serious enough I believe. > > So pretty much everyone running on Linux + ext3/ext4 is vulnerable. > > It's also worth mentioning that the data is not actually lost - it's > properly fsynced in the WAL segments, it's just the rename that got lost. So > it's possible to survive this without losing data by manually renaming the > segments, but this must happen before starting the cluster because the > automatic recovery comes and discards all the data etc. Hm. Most users are not going to notice that, particularly where things are embedded. > I think this issue might also result in various other issues, not just data > loss. For example, I wouldn't be surprised by data corruption due to > flushing some of the changes in data files to disk (due to contention for > shared buffers and reaching vm.dirty_bytes) and then losing the matching WAL > segment. Also, while I have only seen 1 to 3 segments getting lost, it might > be possible that more segments can get lost, possibly making the recovery > impossible. And of course, this might cause problems with WAL archiving due > to archiving the same > segment twice (before and after crash). Possible, the switch to .done is done after renaming the segment in xlogarchive.c. So this could happen in theory. > Attached is a proposed fix for this (xlog-fsync.patch), and I'm pretty sure > this needs to be backpatched to all backbranches. I've also attached a patch > that adds pg_current_xlog_flush_location() function, which proved to be > quite useful when debugging this issue. Agreed. We should be sure as well that the calls to fsync_fname get issued in a critical section with START/END_CRIT_SECTION(). It does not seem to be the case with your patch. > And finally, I've done a quick review of all places that might suffer the > same issue - some are not really interesting as the stuff is ephemeral > anyway (like pgstat for example), but there are ~15 places that may need > this fix: > > * src/backend/access/transam/timeline.c (2 matches) > * src/backend/access/transam/xlog.c (9 matches) > * src/backend/access/transam/xlogarchive.c (3 matches) > * src/backend/postmaster/pgarch.c (1 match) > > Some of these places might be actually safe because a fsync happens > somewhere immediately after the rename (e.g. in a caller), but I guess > better safe than sorry. I had a quick look at those code paths and indeed it would be safer to be sure that once rename() is called we issue those fsync calls. > I plan to do more power failure testing soon, with more complex test > scenarios. I suspect there might be other similar issues (e.g. when we > rename a file before a checkpoint and don't fsync the directory - then the > rename won't be replayed and will be lost). That would be great. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers