On Thu, Mar 7, 2019 at 6:26 PM Thomas Munro <thomas.mu...@gmail.com> wrote:

> On Fri, Mar 8, 2019 at 12:35 PM Jerry Jelinek <jerry.jeli...@joyent.com>
> wrote:
> > On Thu, Mar 7, 2019 at 3:09 PM Thomas Munro <thomas.mu...@gmail.com>
> wrote:
> >> My understanding is that it's not really the COW-ness that makes it
> >> not necessary, it's the fact that fdatasync() doesn't do anything
> >> different from fsync() on ZFS and there is no situation where
> >> fdatasync() succeeds, you lose power, you come back up and find that
> >> the file size is wrong or a hole in the middle of the file has come
> >> back from the dead, and you lost the data.  The whole concept of "data
> >> sync" implies that file meta-data and file contents are cached and
> >> synchronized separately and you can deliberately ask for weaker
> >> coherency to cut down on IOs; *that's* the thing that ZFS doesn't
> >> have, and couldn't benefit from because it's just going to write stuff
> >> in its tidy sequential log in the form of all-or-nothing transactions
> >> anyway.  I don't know if that's also true for eg BTRFS or any other
> >> COW filesystem that might be out there, but I don't know why you'd
> >> want to mention COW instead of wal_sync_mode as the motivation when
> >> the source code comment know better.
> >
> >
> > Hopefully I am not misinterpreting your comment here, but I'm not sure I
> fully agree with that assessment. I can't speak for other filesystems, but
> for ZFS, none of the zero-filled blocks will actually be written to disk,
> but that determination happens fairly late in the process, after the
> thousands of write system calls have been processed. So on ZFS, these
> writes are basically useless, except as a way to increase the size of the
> file. No disk space is actually allocated. However, on any COW filesystem,
> any write to any of these zero-filled blocks will have to allocate a new
> block, so nothing about "preallocating space" has been accomplished by all
> of these system calls. At least, preallocating space is my understanding of
> why the zero-fill is currently being performed.
>
> It seems like you're focusing on the performance and I'm focusing on
> the safety.  Obviously it's a complete waste of time to try to
> "preallocate" space on COW filesystems since they will not reuse that
> space anyway by definition.  My point was that it may be unsafe to
> turn if off when configured to use fdatasync() for later writes to the
> file on filesystems that make fewer durability guarantees with
> fdatasync() than with a full fsync(), and that seems like another
> newsworthy angle on this for end users to know about.  I dunno, maybe
> those things are so closely linked that it's OK to write just "only
> turn it off on COW filesystems", but I'm wondering why we don't
> mention the actual reason for the feature when we make that claim in
> the comments.
>
> Hmm... digging a bit further.  So, those comments in xlog.c date from
> 2013/2014 when this stuff was going down:
>
> https://lkml.org/lkml/2012/9/3/83
>
> https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zheng_mai
> http://www.openldap.org/lists/openldap-devel/201411/msg00002.html
>
> So that was not actually the intended behaviour of fdatasync(), but
> rather a bug in ext3/4 that's been fixed now.  POSIX says "File
> attributes that are not necessary for data retrieval (access time,
> modification time, status change time) need not be successfully
> transferred prior to returning to the calling process.", and the Linux
> man page says that it "does not flush modified metadata unless that
> metadata is needed in order to allow a subsequent data retrieval to be
> correctly handled", so... if there are no OS bugs, the comments in
> xlog.c are overly pessimistic and the only effect of using fdatasync()
> instead of fsync() to to avoid extra IOs for mtime etc.
>
> I still like the pessimism in the code.  But OK, I withdraw my
> complaint about that sentence in the documentation for now!  :-)


 I haven't heard any additional feedback in the last couple of weeks, so I
wanted to see if there is anything else needed for this patch? I did update
the code to use pg_pwrite. A new version of the patch is attached. The only
difference from the previous version is this diff:

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3283,8 +3283,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent,
bool use_lock)
                 */
                errno = 0;
                pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
-               fail = lseek(fd, wal_segment_size - 1, SEEK_SET) < (off_t)
0 ||
-                       (int) write(fd, zbuffer.data, 1) != (int) 1;
+               fail = pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1)
!=
+                       (ssize_t) 1;
                pgstat_report_wait_end();
        }

The latest patch is rebased, builds clean, and passes some basic testing.
Please let me know if there is anything else I could do on this.

Thanks,
Jerry

Attachment: 0001-wal_recycle-and-wal_init_zero.patch
Description: Binary data

Reply via email to