On 2019-Feb-05, Jerry Jelinek wrote:

> First, since last fall, we have found another performance problem related
> to initializing WAL files. I've described this issue in more detail below,
> but in order to handle this new problem, I decided to generalize the patch
> so the tunable refers to running on a Copy-On-Write filesystem instead of
> just being specific to WAL recycling. Specifically, I renamed the GUC
> tunable from 'wal_recycle' to 'wal_cow_fs'. Hopefully this will make it
> more obvious what is being tuned and will also be more flexible if there
> are other problems in the future which are related to running on a COW
> filesystem. I'm happy to choose a different name for the tunable if people
> don't like 'wal_cow_fs'.

I think the idea of it being a generic tunable for assorted behavior
changes, rather than specific to WAL recycling, is a good one.  I'm
unsure about your proposed name -- maybe "wal_cow_filesystem" is better?

I'm rewording your doc addition a little bit.  Here's my proposal:

       <para>
        This parameter should only be set to <literal>on</literal> when the WAL
        resides on a <firstterm>Copy-On-Write</firstterm> 
(<acronym>COW</acronym>)
        filesystem.
        Enabling this option adjusts behavior to take advantage of the
        filesystem characteristics (for example, recycling WAL files and
        zero-filling new WAL files are disabled).

This part sounds good enough to me -- further suggestions welcome.

I'm less sure about this phrase:

        This setting is only appropriate for filesystems which
        allocate new disk blocks on every write.

Is "... which allocate new disk blocks on every write" a technique
distinct from CoW itself?  I'm confused as to what it means, or how can
the user tell whether they are on such a filesystem.

Obviously you're thinking that ZFS is such a filesystem and everybody
who has pg_wal on ZFS should enable this option.  What about, say, Btrfs
-- should they turn this option on?  Browsing the wikipedia, I find that
Windows has this ReFS thing that apparently is also CoW, but NTFS isn't.
I don't think either Btrfs or ReFS are realistic options to put pg_wal
on, so let's just list the common filesystems for which users are
supposed to enable this option ... which I think nowadays is just ZFS.
All in all, I would replace this phrase with something like: "This
setting should be enabled when pg_wal resides on a ZFS filesystem or
similar." That should be weasely enough that it's clear that we expect
users to do the homework when on unusual systems, while actively pointing
out the most common use case.

> Finally, the patch now includes bypassing the zero-fill for new WAL files
> when wal_cow_fs is true.

That makes sense.  I think all these benchmarks Tomas Vondra run are not
valid anymore ...

The attached v2 has assorted cosmetic cleanups.  If you can validate it,
I would appreciate it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From a4ce02f5cc8ad983c34712083f9cba7fda6d5b38 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Date: Wed, 27 Feb 2019 19:41:05 -0300
Subject: [PATCH v2] pg_wal on COW fs

---
 doc/src/sgml/config.sgml                      |  20 ++++
 src/backend/access/transam/xlog.c             | 101 ++++++++++++------
 src/backend/utils/misc/guc.c                  |  13 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   1 +
 5 files changed, 102 insertions(+), 34 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8bd57f376b2..60a873273aa 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2959,6 +2959,26 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-wal-cow-filesystem" xreflabel="wal_cow_filesystem">
+      <term><varname>wal_cow_filesystem</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>wal_cow_filesystem</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        This parameter should only be set to <literal>on</literal> when the WAL
+        resides on a <firstterm>Copy-On-Write</firstterm> (<acronym>COW</acronym>)
+        filesystem.
+        Enabling this option adjusts some behavior to take advantage of the
+        filesystem characteristics (for example, recycling WAL files and
+        zero-filling new WAL files are disabled).
+        This setting is only appropriate for filesystems which
+        allocate new disk blocks on every write.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
      </sect2>
      <sect2 id="runtime-config-wal-archiving">
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ecd12fc53ae..1acce1c70d6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -94,6 +94,7 @@ bool		wal_log_hints = false;
 bool		wal_compression = false;
 char	   *wal_consistency_checking_string = NULL;
 bool	   *wal_consistency_checking = NULL;
+bool		wal_cow_filesystem = false;
 bool		log_checkpoints = false;
 int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
@@ -3216,6 +3217,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	XLogSegNo	max_segno;
 	int			fd;
 	int			nbytes;
+	bool		fail;
 
 	XLogFilePath(path, ThisTimeLineID, logsegno, wal_segment_size);
 
@@ -3255,41 +3257,65 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
 
-	/*
-	 * Zero-fill the file.  We have to do this the hard way to ensure that all
-	 * the file space has really been allocated --- on platforms that allow
-	 * "holes" in files, just seeking to the end doesn't allocate intermediate
-	 * space.  This way, we know that we have all the space and (after the
-	 * fsync below) that all the indirect blocks are down on disk.  Therefore,
-	 * fdatasync(2) or O_DSYNC will be sufficient to sync future writes to the
-	 * log file.
-	 */
 	memset(zbuffer.data, 0, XLOG_BLCKSZ);
-	for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ)
+
+	/*
+	 * Ensure the filesystem has physically allocated disk space for the data.
+	 */
+	if (!wal_cow_filesystem)
 	{
+		/*
+		 * In non-CoW filesystems, zero-fill the file.  We have to do this the
+		 * hard way to ensure that all the file space has really been
+		 * allocated --- on platforms that allow "holes" in files, just seeking
+		 * to the end doesn't allocate intermediate space.  This way, we know
+		 * that we have all the space and (after the fsync below) that all the
+		 * indirect blocks are down on disk.  Therefore, fdatasync(2) or
+		 * O_DSYNC will be sufficient to sync future writes to the log file.
+		 */
+		fail = false;	/* keep compiler quiet */
+		for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ)
+		{
+			errno = 0;
+			pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
+			fail = (int) write(fd, zbuffer.data, XLOG_BLCKSZ) != (int) XLOG_BLCKSZ;
+			pgstat_report_wait_end();
+			if (fail)
+				break;
+		}
+	}
+	else
+	{
+		/*
+		 * In CoW filesystems, seeking to the end and writing a solitary byte
+		 * is enough.
+		 */
 		errno = 0;
 		pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
-		if ((int) write(fd, zbuffer.data, XLOG_BLCKSZ) != (int) XLOG_BLCKSZ)
-		{
-			int			save_errno = errno;
 
-			/*
-			 * If we fail to make the file, delete it to release disk space
-			 */
-			unlink(tmppath);
+		fail = lseek(fd, (off_t) (wal_segment_size - 1), SEEK_SET) < (off_t) 0 ||
+			(int) write(fd, zbuffer.data, 1) != (int) 1;
 
-			close(fd);
-
-			/* if write didn't set errno, assume problem is no disk space */
-			errno = save_errno ? save_errno : ENOSPC;
-
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not write to file \"%s\": %m", tmppath)));
-		}
 		pgstat_report_wait_end();
 	}
 
+	if (fail)
+	{
+		int			save_errno = errno;
+
+		/* If we fail to make the file, delete it to release disk space */
+		unlink(tmppath);
+
+		close(fd);
+
+		/* if write didn't set errno, assume problem is no disk space */
+		errno = save_errno ? save_errno : ENOSPC;
+
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", tmppath)));
+	}
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -4053,14 +4079,17 @@ RemoveXlogFile(const char *segname, XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 	XLogSegNo	endlogSegNo;
 	XLogSegNo	recycleSegNo;
 
-	/*
-	 * Initialize info about where to try to recycle to.
-	 */
-	XLByteToSeg(endptr, endlogSegNo, wal_segment_size);
-	if (RedoRecPtr == InvalidXLogRecPtr)
-		recycleSegNo = endlogSegNo + 10;
+	/* Initialize info about where to try to recycle to, if needed. */
+	if (!wal_cow_filesystem)
+	{
+		XLByteToSeg(endptr, endlogSegNo, wal_segment_size);
+		if (RedoRecPtr == InvalidXLogRecPtr)
+			recycleSegNo = endlogSegNo + 10;
+		else
+			recycleSegNo = XLOGfileslop(RedoRecPtr);
+	}
 	else
-		recycleSegNo = XLOGfileslop(RedoRecPtr);
+		recycleSegNo = (XLogSegNo) 0;	/* keep compiler quiet */
 
 	snprintf(path, MAXPGPATH, XLOGDIR "/%s", segname);
 
@@ -4068,8 +4097,12 @@ RemoveXlogFile(const char *segname, XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 	 * Before deleting the file, see if it can be recycled as a future log
 	 * segment. Only recycle normal files, pg_standby for example can create
 	 * symbolic links pointing to a separate archive directory.
+	 *
+	 * Skip recycling on COW filesystems, though.  It's better to create
+	 * new files each time.
 	 */
-	if (endlogSegNo <= recycleSegNo &&
+	if (!wal_cow_filesystem &&
+		endlogSegNo <= recycleSegNo &&
 		lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 		InstallXLogFileSegment(&endlogSegNo, path,
 							   true, recycleSegNo, true))
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 156d147c85b..cbcf475885e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1177,6 +1177,19 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"wal_cow_filesystem", PGC_SUSET, WAL_SETTINGS,
+			gettext_noop("WAL is stored on Copy-On-Write file system."),
+			gettext_noop("This option adjusts behavior to take advantage of "
+						 "filesystem characteristics specific to CoW filesystems, "
+						 "improving performance. "
+						 "It should be enabled on ZFS and other similar filesystems.")
+		},
+		&wal_cow_filesystem,
+		false,
+		NULL, NULL, NULL
+	},
+
 	{
 		{"log_checkpoints", PGC_SIGHUP, LOGGING_WHAT,
 			gettext_noop("Logs each checkpoint."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 194f3120964..c9254c04bab 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -206,6 +206,7 @@
 #wal_compression = off			# enable compression of full-page writes
 #wal_log_hints = off			# also do full page writes of non-critical updates
 					# (change requires restart)
+#wal_cow_filesystem = off	# is pg_wal on a Copy-on-Write filesystem?
 #wal_buffers = -1			# min 32kB, -1 sets based on shared_buffers
 					# (change requires restart)
 #wal_writer_delay = 200ms		# 1-10000 milliseconds
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f90a6a91391..fa9d69affe4 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -116,6 +116,7 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool wal_compression;
+extern bool wal_cow_filesystem;
 extern bool *wal_consistency_checking;
 extern char *wal_consistency_checking_string;
 extern bool log_checkpoints;
-- 
2.17.1

Reply via email to