On 2019-Feb-05, Jerry Jelinek wrote: > First, since last fall, we have found another performance problem related > to initializing WAL files. I've described this issue in more detail below, > but in order to handle this new problem, I decided to generalize the patch > so the tunable refers to running on a Copy-On-Write filesystem instead of > just being specific to WAL recycling. Specifically, I renamed the GUC > tunable from 'wal_recycle' to 'wal_cow_fs'. Hopefully this will make it > more obvious what is being tuned and will also be more flexible if there > are other problems in the future which are related to running on a COW > filesystem. I'm happy to choose a different name for the tunable if people > don't like 'wal_cow_fs'.
I think the idea of it being a generic tunable for assorted behavior changes, rather than specific to WAL recycling, is a good one. I'm unsure about your proposed name -- maybe "wal_cow_filesystem" is better? I'm rewording your doc addition a little bit. Here's my proposal: <para> This parameter should only be set to <literal>on</literal> when the WAL resides on a <firstterm>Copy-On-Write</firstterm> (<acronym>COW</acronym>) filesystem. Enabling this option adjusts behavior to take advantage of the filesystem characteristics (for example, recycling WAL files and zero-filling new WAL files are disabled). This part sounds good enough to me -- further suggestions welcome. I'm less sure about this phrase: This setting is only appropriate for filesystems which allocate new disk blocks on every write. Is "... which allocate new disk blocks on every write" a technique distinct from CoW itself? I'm confused as to what it means, or how can the user tell whether they are on such a filesystem. Obviously you're thinking that ZFS is such a filesystem and everybody who has pg_wal on ZFS should enable this option. What about, say, Btrfs -- should they turn this option on? Browsing the wikipedia, I find that Windows has this ReFS thing that apparently is also CoW, but NTFS isn't. I don't think either Btrfs or ReFS are realistic options to put pg_wal on, so let's just list the common filesystems for which users are supposed to enable this option ... which I think nowadays is just ZFS. All in all, I would replace this phrase with something like: "This setting should be enabled when pg_wal resides on a ZFS filesystem or similar." That should be weasely enough that it's clear that we expect users to do the homework when on unusual systems, while actively pointing out the most common use case. > Finally, the patch now includes bypassing the zero-fill for new WAL files > when wal_cow_fs is true. That makes sense. I think all these benchmarks Tomas Vondra run are not valid anymore ... The attached v2 has assorted cosmetic cleanups. If you can validate it, I would appreciate it. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From a4ce02f5cc8ad983c34712083f9cba7fda6d5b38 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera <alvhe...@alvh.no-ip.org> Date: Wed, 27 Feb 2019 19:41:05 -0300 Subject: [PATCH v2] pg_wal on COW fs --- doc/src/sgml/config.sgml | 20 ++++ src/backend/access/transam/xlog.c | 101 ++++++++++++------ src/backend/utils/misc/guc.c | 13 +++ src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/access/xlog.h | 1 + 5 files changed, 102 insertions(+), 34 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 8bd57f376b2..60a873273aa 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2959,6 +2959,26 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-wal-cow-filesystem" xreflabel="wal_cow_filesystem"> + <term><varname>wal_cow_filesystem</varname> (<type>boolean</type>) + <indexterm> + <primary><varname>wal_cow_filesystem</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + This parameter should only be set to <literal>on</literal> when the WAL + resides on a <firstterm>Copy-On-Write</firstterm> (<acronym>COW</acronym>) + filesystem. + Enabling this option adjusts some behavior to take advantage of the + filesystem characteristics (for example, recycling WAL files and + zero-filling new WAL files are disabled). + This setting is only appropriate for filesystems which + allocate new disk blocks on every write. + </para> + </listitem> + </varlistentry> + </variablelist> </sect2> <sect2 id="runtime-config-wal-archiving"> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index ecd12fc53ae..1acce1c70d6 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -94,6 +94,7 @@ bool wal_log_hints = false; bool wal_compression = false; char *wal_consistency_checking_string = NULL; bool *wal_consistency_checking = NULL; +bool wal_cow_filesystem = false; bool log_checkpoints = false; int sync_method = DEFAULT_SYNC_METHOD; int wal_level = WAL_LEVEL_MINIMAL; @@ -3216,6 +3217,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock) XLogSegNo max_segno; int fd; int nbytes; + bool fail; XLogFilePath(path, ThisTimeLineID, logsegno, wal_segment_size); @@ -3255,41 +3257,65 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock) (errcode_for_file_access(), errmsg("could not create file \"%s\": %m", tmppath))); - /* - * Zero-fill the file. We have to do this the hard way to ensure that all - * the file space has really been allocated --- on platforms that allow - * "holes" in files, just seeking to the end doesn't allocate intermediate - * space. This way, we know that we have all the space and (after the - * fsync below) that all the indirect blocks are down on disk. Therefore, - * fdatasync(2) or O_DSYNC will be sufficient to sync future writes to the - * log file. - */ memset(zbuffer.data, 0, XLOG_BLCKSZ); - for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ) + + /* + * Ensure the filesystem has physically allocated disk space for the data. + */ + if (!wal_cow_filesystem) { + /* + * In non-CoW filesystems, zero-fill the file. We have to do this the + * hard way to ensure that all the file space has really been + * allocated --- on platforms that allow "holes" in files, just seeking + * to the end doesn't allocate intermediate space. This way, we know + * that we have all the space and (after the fsync below) that all the + * indirect blocks are down on disk. Therefore, fdatasync(2) or + * O_DSYNC will be sufficient to sync future writes to the log file. + */ + fail = false; /* keep compiler quiet */ + for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ) + { + errno = 0; + pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE); + fail = (int) write(fd, zbuffer.data, XLOG_BLCKSZ) != (int) XLOG_BLCKSZ; + pgstat_report_wait_end(); + if (fail) + break; + } + } + else + { + /* + * In CoW filesystems, seeking to the end and writing a solitary byte + * is enough. + */ errno = 0; pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE); - if ((int) write(fd, zbuffer.data, XLOG_BLCKSZ) != (int) XLOG_BLCKSZ) - { - int save_errno = errno; - /* - * If we fail to make the file, delete it to release disk space - */ - unlink(tmppath); + fail = lseek(fd, (off_t) (wal_segment_size - 1), SEEK_SET) < (off_t) 0 || + (int) write(fd, zbuffer.data, 1) != (int) 1; - close(fd); - - /* if write didn't set errno, assume problem is no disk space */ - errno = save_errno ? save_errno : ENOSPC; - - ereport(ERROR, - (errcode_for_file_access(), - errmsg("could not write to file \"%s\": %m", tmppath))); - } pgstat_report_wait_end(); } + if (fail) + { + int save_errno = errno; + + /* If we fail to make the file, delete it to release disk space */ + unlink(tmppath); + + close(fd); + + /* if write didn't set errno, assume problem is no disk space */ + errno = save_errno ? save_errno : ENOSPC; + + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not write to file \"%s\": %m", tmppath))); + } + pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC); if (pg_fsync(fd) != 0) { @@ -4053,14 +4079,17 @@ RemoveXlogFile(const char *segname, XLogRecPtr RedoRecPtr, XLogRecPtr endptr) XLogSegNo endlogSegNo; XLogSegNo recycleSegNo; - /* - * Initialize info about where to try to recycle to. - */ - XLByteToSeg(endptr, endlogSegNo, wal_segment_size); - if (RedoRecPtr == InvalidXLogRecPtr) - recycleSegNo = endlogSegNo + 10; + /* Initialize info about where to try to recycle to, if needed. */ + if (!wal_cow_filesystem) + { + XLByteToSeg(endptr, endlogSegNo, wal_segment_size); + if (RedoRecPtr == InvalidXLogRecPtr) + recycleSegNo = endlogSegNo + 10; + else + recycleSegNo = XLOGfileslop(RedoRecPtr); + } else - recycleSegNo = XLOGfileslop(RedoRecPtr); + recycleSegNo = (XLogSegNo) 0; /* keep compiler quiet */ snprintf(path, MAXPGPATH, XLOGDIR "/%s", segname); @@ -4068,8 +4097,12 @@ RemoveXlogFile(const char *segname, XLogRecPtr RedoRecPtr, XLogRecPtr endptr) * Before deleting the file, see if it can be recycled as a future log * segment. Only recycle normal files, pg_standby for example can create * symbolic links pointing to a separate archive directory. + * + * Skip recycling on COW filesystems, though. It's better to create + * new files each time. */ - if (endlogSegNo <= recycleSegNo && + if (!wal_cow_filesystem && + endlogSegNo <= recycleSegNo && lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) && InstallXLogFileSegment(&endlogSegNo, path, true, recycleSegNo, true)) diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 156d147c85b..cbcf475885e 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -1177,6 +1177,19 @@ static struct config_bool ConfigureNamesBool[] = NULL, NULL, NULL }, + { + {"wal_cow_filesystem", PGC_SUSET, WAL_SETTINGS, + gettext_noop("WAL is stored on Copy-On-Write file system."), + gettext_noop("This option adjusts behavior to take advantage of " + "filesystem characteristics specific to CoW filesystems, " + "improving performance. " + "It should be enabled on ZFS and other similar filesystems.") + }, + &wal_cow_filesystem, + false, + NULL, NULL, NULL + }, + { {"log_checkpoints", PGC_SIGHUP, LOGGING_WHAT, gettext_noop("Logs each checkpoint."), diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index 194f3120964..c9254c04bab 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -206,6 +206,7 @@ #wal_compression = off # enable compression of full-page writes #wal_log_hints = off # also do full page writes of non-critical updates # (change requires restart) +#wal_cow_filesystem = off # is pg_wal on a Copy-on-Write filesystem? #wal_buffers = -1 # min 32kB, -1 sets based on shared_buffers # (change requires restart) #wal_writer_delay = 200ms # 1-10000 milliseconds diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index f90a6a91391..fa9d69affe4 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -116,6 +116,7 @@ extern bool EnableHotStandby; extern bool fullPageWrites; extern bool wal_log_hints; extern bool wal_compression; +extern bool wal_cow_filesystem; extern bool *wal_consistency_checking; extern char *wal_consistency_checking_string; extern bool log_checkpoints; -- 2.17.1