On Thu, Mar 7, 2019 at 6:26 PM Thomas Munro <thomas.mu...@gmail.com> wrote:
> On Fri, Mar 8, 2019 at 12:35 PM Jerry Jelinek <jerry.jeli...@joyent.com> > wrote: > > On Thu, Mar 7, 2019 at 3:09 PM Thomas Munro <thomas.mu...@gmail.com> > wrote: > >> My understanding is that it's not really the COW-ness that makes it > >> not necessary, it's the fact that fdatasync() doesn't do anything > >> different from fsync() on ZFS and there is no situation where > >> fdatasync() succeeds, you lose power, you come back up and find that > >> the file size is wrong or a hole in the middle of the file has come > >> back from the dead, and you lost the data. The whole concept of "data > >> sync" implies that file meta-data and file contents are cached and > >> synchronized separately and you can deliberately ask for weaker > >> coherency to cut down on IOs; *that's* the thing that ZFS doesn't > >> have, and couldn't benefit from because it's just going to write stuff > >> in its tidy sequential log in the form of all-or-nothing transactions > >> anyway. I don't know if that's also true for eg BTRFS or any other > >> COW filesystem that might be out there, but I don't know why you'd > >> want to mention COW instead of wal_sync_mode as the motivation when > >> the source code comment know better. > > > > > > Hopefully I am not misinterpreting your comment here, but I'm not sure I > fully agree with that assessment. I can't speak for other filesystems, but > for ZFS, none of the zero-filled blocks will actually be written to disk, > but that determination happens fairly late in the process, after the > thousands of write system calls have been processed. So on ZFS, these > writes are basically useless, except as a way to increase the size of the > file. No disk space is actually allocated. However, on any COW filesystem, > any write to any of these zero-filled blocks will have to allocate a new > block, so nothing about "preallocating space" has been accomplished by all > of these system calls. At least, preallocating space is my understanding of > why the zero-fill is currently being performed. > > It seems like you're focusing on the performance and I'm focusing on > the safety. Obviously it's a complete waste of time to try to > "preallocate" space on COW filesystems since they will not reuse that > space anyway by definition. My point was that it may be unsafe to > turn if off when configured to use fdatasync() for later writes to the > file on filesystems that make fewer durability guarantees with > fdatasync() than with a full fsync(), and that seems like another > newsworthy angle on this for end users to know about. I dunno, maybe > those things are so closely linked that it's OK to write just "only > turn it off on COW filesystems", but I'm wondering why we don't > mention the actual reason for the feature when we make that claim in > the comments. > > Hmm... digging a bit further. So, those comments in xlog.c date from > 2013/2014 when this stuff was going down: > > https://lkml.org/lkml/2012/9/3/83 > > https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zheng_mai > http://www.openldap.org/lists/openldap-devel/201411/msg00002.html > > So that was not actually the intended behaviour of fdatasync(), but > rather a bug in ext3/4 that's been fixed now. POSIX says "File > attributes that are not necessary for data retrieval (access time, > modification time, status change time) need not be successfully > transferred prior to returning to the calling process.", and the Linux > man page says that it "does not flush modified metadata unless that > metadata is needed in order to allow a subsequent data retrieval to be > correctly handled", so... if there are no OS bugs, the comments in > xlog.c are overly pessimistic and the only effect of using fdatasync() > instead of fsync() to to avoid extra IOs for mtime etc. > > I still like the pessimism in the code. But OK, I withdraw my > complaint about that sentence in the documentation for now! :-) I haven't heard any additional feedback in the last couple of weeks, so I wanted to see if there is anything else needed for this patch? I did update the code to use pg_pwrite. A new version of the patch is attached. The only difference from the previous version is this diff: --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3283,8 +3283,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock) */ errno = 0; pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE); - fail = lseek(fd, wal_segment_size - 1, SEEK_SET) < (off_t) 0 || - (int) write(fd, zbuffer.data, 1) != (int) 1; + fail = pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1) != + (ssize_t) 1; pgstat_report_wait_end(); } The latest patch is rebased, builds clean, and passes some basic testing. Please let me know if there is anything else I could do on this. Thanks, Jerry
0001-wal_recycle-and-wal_init_zero.patch
Description: Binary data