On Mon, Dec 21, 2020 at 11:40 AM Andres Freund <and...@anarazel.de> wrote:
> On 2020-12-20 16:26:42 +1300, Thomas Munro wrote:
> > > 1. port.h cannot assume that <limits.h> has already been included;
> > > nor do I want to fix that by including <limits.h> there.  Do we
> > > really need to define a fallback value of IOV_MAX?  If so,
> > > maybe the answer is to put the replacement struct iovec and
> > > IOV_MAX in some new header.
> >
> > Ok, I moved all this stuff into port/pg_uio.h.
>
> Can we come up with a better name than 'uio'? I find that a not exactly
> meaningful name.

Ok, let's try port/pg_iovec.h.

> Or perhaps we could just leave the functions in port/port.h, but extract
> the value of the define at configure time? That way only pread.c /
> pwrite.c would need it.

That won't work for the struct definition, so client code would need
to remember to do:

#ifdef HAVE_SYS_UIO_H
#include <sys/uio.h>
#endif

... which is a little tedious, or port.h would need to do that and
incur an overhead in every translation unit, which Tom objected to.
That's why I liked the separate header idea.

> > > 3. The patch as given won't prove anything except that the code
> > > compiles.  Is it worth fixing at least one code path to make
> > > use of pg_preadv and pg_pwritev, so we can make sure this code
> > > is tested before there's layers of other new code on top?
> >
> > OK, here's a patch to zero-fill fresh WAL segments with pwritev().
>
> I think that's a good idea. However, I see two small issues: 1) If we
> write larger amounts at once, we need to handle partial writes. That's a
> large enough amount of IO that it's much more likely to hit a memory
> shortage or such in the kernel - we had to do that a while a go for WAL
> writes (which can also be larger), if memory serves.
>
> Perhaps we should have pg_pwritev/readv unconditionally go through
> pwrite.c/pread.c and add support for "continuing" partial writes/reads
> in one central place?

Ok, here's a new version with the following changes:

1.  Define PG_IOV_MAX, a reasonable size for local variables, not
larger than IOV_MAX.
2   Use 32 rather than 1024, based on off-list complaint about 1024
potentially swamping the IO system unfairly.
3.  Add a wrapper pg_pwritev_retry() to retry automatically on short
writes.  (I didn't write pg_preadv_retry(), because I don't currently
need it for anything so I didn't want to have to think about EOF vs
short-reads-for-implementation-reasons.)
4.  I considered whether pg_pwrite() should have built-in retry
instead of a separate wrapper, but I thought of an argument against
hiding the "raw" version: the AIO patch set already understands short
reads/writes and knows how to retry at a higher level, as that's
needed for native AIO too, so I think it makes sense to be able to
keep things the same and not solve the same problem twice.  A counter
argument would be that you could get the retry underway sooner with a
tight loop, but I'm not expecting this to be common.

> > I'm drawing a blank on trivial candidate uses for preadv(), without
> > infrastructure from later patches.
>
> Can't immediately think of something either.

One idea I had for the future is for xlogreader.c to read the WAL into
a large multi-page circular buffer, which could wrap around using a
pair of iovecs, but that requires lots more work    .
From 3da0b25b4ce0804355f912bd026ef9c9eee146f3 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Thu, 26 Nov 2020 15:48:31 +1300
Subject: [PATCH v3 1/2] Add pg_preadv() and pg_pwritev().

Provide synchronous scatter/gather I/O routines.  These map to preadv(),
pwritev() with various fallbacks for systems that don't have them.  Also
provide a wrapper pg_pwritev_retry() which automatically retries on
short write.

Reviewed-by: Tom Lane <t...@sss.pgh.pa.us>
Reviewed-by: Andres Freund <and...@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGJA%2Bu-220VONeoREBXJ9P3S94Y7J%2BkqCnTYmahvZJwM%3Dg%40mail.gmail.com
---
 configure                   |  30 +---------
 configure.ac                |   9 ++-
 src/include/pg_config.h.in  |  15 +++++
 src/include/port.h          |   2 +
 src/include/port/pg_iovec.h |  57 +++++++++++++++++++
 src/port/Makefile           |   2 +
 src/port/pread.c            |  43 +++++++++++++-
 src/port/pwrite.c           | 109 +++++++++++++++++++++++++++++++++++-
 src/tools/msvc/Solution.pm  |   5 ++
 9 files changed, 238 insertions(+), 34 deletions(-)
 create mode 100644 src/include/port/pg_iovec.h

diff --git a/configure b/configure
index 11a4284e5b..5887c712cc 100755
--- a/configure
+++ b/configure
@@ -13061,7 +13061,7 @@ $as_echo "#define HAVE_STDBOOL_H 1" >>confdefs.h
 fi
 
 
-for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/event.h sys/ipc.h sys/prctl.h sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/un.h termios.h ucred.h wctype.h
+for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/event.h sys/ipc.h sys/prctl.h sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/uio.h sys/un.h termios.h ucred.h wctype.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
@@ -15155,7 +15155,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l
+for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pread preadv pstat pthread_is_threaded_np pwrite pwritev readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l writev
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
@@ -15832,32 +15832,6 @@ esac
 
 fi
 
-ac_fn_c_check_func "$LINENO" "pread" "ac_cv_func_pread"
-if test "x$ac_cv_func_pread" = xyes; then :
-  $as_echo "#define HAVE_PREAD 1" >>confdefs.h
-
-else
-  case " $LIBOBJS " in
-  *" pread.$ac_objext "* ) ;;
-  *) LIBOBJS="$LIBOBJS pread.$ac_objext"
- ;;
-esac
-
-fi
-
-ac_fn_c_check_func "$LINENO" "pwrite" "ac_cv_func_pwrite"
-if test "x$ac_cv_func_pwrite" = xyes; then :
-  $as_echo "#define HAVE_PWRITE 1" >>confdefs.h
-
-else
-  case " $LIBOBJS " in
-  *" pwrite.$ac_objext "* ) ;;
-  *) LIBOBJS="$LIBOBJS pwrite.$ac_objext"
- ;;
-esac
-
-fi
-
 ac_fn_c_check_func "$LINENO" "random" "ac_cv_func_random"
 if test "x$ac_cv_func_random" = xyes; then :
   $as_echo "#define HAVE_RANDOM 1" >>confdefs.h
diff --git a/configure.ac b/configure.ac
index fc523c6aeb..c930463225 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1331,6 +1331,7 @@ AC_CHECK_HEADERS(m4_normalize([
 	sys/shm.h
 	sys/sockio.h
 	sys/tas.h
+	sys/uio.h
 	sys/un.h
 	termios.h
 	ucred.h
@@ -1660,9 +1661,14 @@ AC_CHECK_FUNCS(m4_normalize([
 	poll
 	posix_fallocate
 	ppoll
+	pread
+	preadv
 	pstat
 	pthread_is_threaded_np
+	pwrite
+	pwritev
 	readlink
+	readv
 	setproctitle
 	setproctitle_fast
 	setsid
@@ -1673,6 +1679,7 @@ AC_CHECK_FUNCS(m4_normalize([
 	sync_file_range
 	uselocale
 	wcstombs_l
+	writev
 ]))
 
 # These typically are compiler builtins, for which AC_CHECK_FUNCS fails.
@@ -1733,8 +1740,6 @@ AC_REPLACE_FUNCS(m4_normalize([
 	inet_aton
 	link
 	mkdtemp
-	pread
-	pwrite
 	random
 	srandom
 	strlcat
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index de8f838e53..582cbfa012 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -412,6 +412,9 @@
 /* Define to 1 if you have the `pread' function. */
 #undef HAVE_PREAD
 
+/* Define to 1 if you have the `preadv' function. */
+#undef HAVE_PREADV
+
 /* Define to 1 if you have the `pstat' function. */
 #undef HAVE_PSTAT
 
@@ -430,6 +433,9 @@
 /* Define to 1 if you have the `pwrite' function. */
 #undef HAVE_PWRITE
 
+/* Define to 1 if you have the `pwritev' function. */
+#undef HAVE_PWRITEV
+
 /* Define to 1 if you have the `random' function. */
 #undef HAVE_RANDOM
 
@@ -445,6 +451,9 @@
 /* Define to 1 if you have the `readlink' function. */
 #undef HAVE_READLINK
 
+/* Define to 1 if you have the `readv' function. */
+#undef HAVE_READV
+
 /* Define to 1 if you have the global variable
    'rl_completion_append_character'. */
 #undef HAVE_RL_COMPLETION_APPEND_CHARACTER
@@ -626,6 +635,9 @@
 /* Define to 1 if you have the <sys/ucred.h> header file. */
 #undef HAVE_SYS_UCRED_H
 
+/* Define to 1 if you have the <sys/uio.h> header file. */
+#undef HAVE_SYS_UIO_H
+
 /* Define to 1 if you have the <sys/un.h> header file. */
 #undef HAVE_SYS_UN_H
 
@@ -680,6 +692,9 @@
 /* Define to 1 if you have the <winldap.h> header file. */
 #undef HAVE_WINLDAP_H
 
+/* Define to 1 if you have the `writev' function. */
+#undef HAVE_WRITEV
+
 /* Define to 1 if you have the `X509_get_signature_nid' function. */
 #undef HAVE_X509_GET_SIGNATURE_NID
 
diff --git a/src/include/port.h b/src/include/port.h
index 5dfb00b07c..87b20518c3 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -431,6 +431,8 @@ extern ssize_t pg_pread(int fd, void *buf, size_t nbyte, off_t offset);
 extern ssize_t pg_pwrite(int fd, const void *buf, size_t nbyte, off_t offset);
 #endif
 
+/* For pg_pwritev() and pg_preadv(), see port/pg_uio.h. */
+
 #if !HAVE_DECL_STRLCAT
 extern size_t strlcat(char *dst, const char *src, size_t siz);
 #endif
diff --git a/src/include/port/pg_iovec.h b/src/include/port/pg_iovec.h
new file mode 100644
index 0000000000..035d54f840
--- /dev/null
+++ b/src/include/port/pg_iovec.h
@@ -0,0 +1,57 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_iovec.h
+ *	  Header for the vectored I/O functions in src/port/p{read,write}.c.
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/port/pg_iovec.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_IOVEC_H
+#define PG_IOVEC_H
+
+#include <limits.h>
+
+#ifdef HAVE_SYS_UIO_H
+#include <sys/uio.h>
+#endif
+
+/* If <sys/uio.h> is missing, define our own POSIX-compatible iovec struct. */
+#ifndef HAVE_SYS_UIO_H
+struct iovec
+{
+	void	   *iov_base;
+	size_t		iov_len;
+};
+#endif
+
+/*
+ * If <limits.h> didn't define IOV_MAX, define our own.  POSIX requires at
+ * least 16.
+ */
+#ifndef IOV_MAX
+#define IOV_MAX 16
+#endif
+
+/* Define a reasonable maximum that is safe to use on the stack. */
+#define PG_IOV_MAX Min(IOV_MAX, 32)
+
+#ifdef HAVE_PREADV
+#define pg_preadv preadv
+#else
+extern ssize_t pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset);
+#endif
+
+#ifdef HAVE_PWRITEV
+#define pg_pwritev pwritev
+#else
+extern ssize_t pg_pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset);
+#endif
+
+extern ssize_t pg_pwritev_retry(int fd, const struct iovec *iov, int iovcnt,
+								off_t offset);
+
+#endif							/* PG_IOVEC_H */
diff --git a/src/port/Makefile b/src/port/Makefile
index e41b005c4f..bc4923ce84 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -53,6 +53,8 @@ OBJS = \
 	pgstrcasecmp.o \
 	pgstrsignal.o \
 	pqsignal.o \
+	pread.o \
+	pwrite.o \
 	qsort.o \
 	qsort_arg.o \
 	quotes.o \
diff --git a/src/port/pread.c b/src/port/pread.c
index e7fecebe5f..fe0c44434d 100644
--- a/src/port/pread.c
+++ b/src/port/pread.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pread.c
- *	  Implementation of pread(2) for platforms that lack one.
+ *	  Implementation of pread[v](2) for platforms that lack one.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  *
@@ -9,7 +9,8 @@
  *	  src/port/pread.c
  *
  * Note that this implementation changes the current file position, unlike
- * the POSIX function, so we use the name pg_pread().
+ * the POSIX function, so we use the name pg_pread().  Likewise for the
+ * iovec version.
  *
  *-------------------------------------------------------------------------
  */
@@ -23,6 +24,9 @@
 #include <unistd.h>
 #endif
 
+#include "port/pg_iovec.h"
+
+#ifndef HAVE_PREAD
 ssize_t
 pg_pread(int fd, void *buf, size_t size, off_t offset)
 {
@@ -56,3 +60,38 @@ pg_pread(int fd, void *buf, size_t size, off_t offset)
 	return read(fd, buf, size);
 #endif
 }
+#endif
+
+#ifndef HAVE_PREADV
+ssize_t
+pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+{
+#ifdef HAVE_READV
+	if (iovcnt == 1)
+		return pg_pread(fd, iov[0].iov_base, iov[0].iov_len, offset);
+	if (lseek(fd, offset, SEEK_SET) < 0)
+		return -1;
+	return readv(fd, iov, iovcnt);
+#else
+	ssize_t 	sum = 0;
+	ssize_t 	part;
+
+	for (int i = 0; i < iovcnt; ++i)
+	{
+		part = pg_pread(fd, iov[i].iov_base, iov[i].iov_len, offset);
+		if (part < 0)
+		{
+			if (i == 0)
+				return -1;
+			else
+				return sum;
+		}
+		sum += part;
+		offset += part;
+		if (part < iov[i].iov_len)
+			return sum;
+	}
+	return sum;
+#endif
+}
+#endif
diff --git a/src/port/pwrite.c b/src/port/pwrite.c
index 2e0c154462..d7ecbf362c 100644
--- a/src/port/pwrite.c
+++ b/src/port/pwrite.c
@@ -1,7 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * pwrite.c
- *	  Implementation of pwrite(2) for platforms that lack one.
+ *	  Implementation of pwrite[v](2) for platforms that lack one.
  *
  * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
  *
@@ -9,7 +9,8 @@
  *	  src/port/pwrite.c
  *
  * Note that this implementation changes the current file position, unlike
- * the POSIX function, so we use the name pg_pwrite().
+ * the POSIX function, so we use the name pg_pwrite().  Likewise for the
+ * iovec version.
  *
  *-------------------------------------------------------------------------
  */
@@ -23,6 +24,9 @@
 #include <unistd.h>
 #endif
 
+#include "port/pg_iovec.h"
+
+#ifndef HAVE_PWRITE
 ssize_t
 pg_pwrite(int fd, const void *buf, size_t size, off_t offset)
 {
@@ -53,3 +57,104 @@ pg_pwrite(int fd, const void *buf, size_t size, off_t offset)
 	return write(fd, buf, size);
 #endif
 }
+#endif
+
+#ifndef HAVE_PWRITEV
+ssize_t
+pg_pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+{
+#ifdef HAVE_WRITEV
+	if (iovcnt == 1)
+		return pg_pwrite(fd, iov[0].iov_base, iov[0].iov_len, offset);
+	if (lseek(fd, offset, SEEK_SET) < 0)
+		return -1;
+	return writev(fd, iov, iovcnt);
+#else
+	ssize_t 	sum = 0;
+	ssize_t 	part;
+
+	for (int i = 0; i < iovcnt; ++i)
+	{
+		part = pg_pwrite(fd, iov[i].iov_base, iov[i].iov_len, offset);
+		if (part < 0)
+		{
+			if (i == 0)
+				return -1;
+			else
+				return sum;
+		}
+		sum += part;
+		offset += part;
+		if (part < iov[i].iov_len)
+			return sum;
+	}
+	return sum;
+#endif
+}
+#endif
+
+/*
+ * A wrapper for pg_pwritev() that retries on partial write.
+ */
+ssize_t
+pg_pwritev_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
+{
+	struct iovec iov_copy[PG_IOV_MAX];
+	ssize_t		goal = 0;
+	ssize_t		sum = 0;
+	ssize_t		part;
+
+	/* We'd better have space to make a copy, in case we need to retry. */
+	if (iovcnt > PG_IOV_MAX)
+	{
+		errno = EINVAL;
+		return -1;
+	}
+
+	/* How much are we trying to write? */
+	for (int i = 0; i < iovcnt; ++i)
+		goal += iov[i].iov_len;
+
+	for (;;)
+	{
+		/* Write as much as we can. */
+		part = pg_pwritev(fd, iov, iovcnt, offset);
+		if (part < 0)
+			return -1;
+
+#ifdef SIMULATE_SHORT_WRITE
+		part = Min(part, 4096);
+#endif
+
+		/* Entirely done yet? */
+		sum += part;
+		if (sum == goal)
+			break;
+
+		/* Step over the part of the file that is done. */
+		Assert(sum < goal);
+		Assert(iovcnt > 0);
+		offset += part;
+
+		/* Step over iovecs that are done. */
+		while (iov->iov_len <= part)
+		{
+			part -= iov->iov_len;
+			++iov;
+			--iovcnt;
+			Assert(iovcnt > 0);
+		}
+
+		/* We need a temporary copy to scribble on. */
+		memmove(iov_copy, iov, sizeof(*iov) * iovcnt);
+		iov = iov_copy;
+
+		/* The first remaining iovec might be partially done.  Adjust it. */
+		Assert(iovcnt > 0);
+		Assert(iov_copy[0].iov_len > part);
+		iov_copy[0].iov_base = (char *) iov_copy[0].iov_base + part;
+		iov_copy[0].iov_len -= part;
+	}
+
+	return sum;
+}
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 22d6abd367..832a363282 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -329,17 +329,20 @@ sub GenerateFiles
 		HAVE_PPC_LWARX_MUTEX_HINT   => undef,
 		HAVE_PPOLL                  => undef,
 		HAVE_PREAD                  => undef,
+		HAVE_PREADV                 => undef,
 		HAVE_PSTAT                  => undef,
 		HAVE_PS_STRINGS             => undef,
 		HAVE_PTHREAD                => undef,
 		HAVE_PTHREAD_IS_THREADED_NP => undef,
 		HAVE_PTHREAD_PRIO_INHERIT   => undef,
 		HAVE_PWRITE                 => undef,
+		HAVE_PWRITEV                => undef,
 		HAVE_RANDOM                 => undef,
 		HAVE_READLINE_H             => undef,
 		HAVE_READLINE_HISTORY_H     => undef,
 		HAVE_READLINE_READLINE_H    => undef,
 		HAVE_READLINK               => undef,
+		HAVE_READV                  => undef,
 		HAVE_RL_COMPLETION_APPEND_CHARACTER      => undef,
 		HAVE_RL_COMPLETION_MATCHES               => undef,
 		HAVE_RL_COMPLETION_SUPPRESS_QUOTE        => undef,
@@ -400,6 +403,7 @@ sub GenerateFiles
 		HAVE_SYS_TYPES_H                         => 1,
 		HAVE_SYS_UCRED_H                         => undef,
 		HAVE_SYS_UN_H                            => undef,
+		HAVE_SYS_UIO_H                           => undef,
 		HAVE_TERMIOS_H                           => undef,
 		HAVE_TYPEOF                              => undef,
 		HAVE_UCRED_H                             => undef,
@@ -417,6 +421,7 @@ sub GenerateFiles
 		HAVE_WINLDAP_H                           => undef,
 		HAVE_WCSTOMBS_L                          => 1,
 		HAVE_WCTYPE_H                            => 1,
+		HAVE_WRITEV                              => undef,
 		HAVE_X509_GET_SIGNATURE_NID              => 1,
 		HAVE_X86_64_POPCNTQ                      => undef,
 		HAVE__BOOL                               => undef,
-- 
2.20.1

From 01e2b824a2092d2bcbf5e53178ba08ef0abfe354 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Sun, 20 Dec 2020 13:16:21 +1300
Subject: [PATCH v3 2/2] Use vectored I/O to zero WAL segments.

Instead of making many block-sized write() calls to fill a 16MB WAL file
with zeroes, make a smaller number of pwritev() calls if available.  The
actual number depends on the OS's IOV_MAX, but we cap it at 32 so we
write 256kB per call on typical systems.

Reviewed-by: Tom Lane <t...@sss.pgh.pa.us>
Reviewed-by: Andres Freund <and...@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGJA%2Bu-220VONeoREBXJ9P3S94Y7J%2BkqCnTYmahvZJwM%3Dg%40mail.gmail.com
---
 src/backend/access/transam/xlog.c | 28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b1e5d2dbff..d7b3d6d5bd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -48,6 +48,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "port/pg_iovec.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
@@ -3270,7 +3271,6 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	XLogSegNo	installed_segno;
 	XLogSegNo	max_segno;
 	int			fd;
-	int			nbytes;
 	int			save_errno;
 
 	XLogFilePath(path, ThisTimeLineID, logsegno, wal_segment_size);
@@ -3317,6 +3317,9 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	save_errno = 0;
 	if (wal_init_zero)
 	{
+		struct iovec iov[PG_IOV_MAX];
+		int			blocks;
+
 		/*
 		 * Zero-fill the file.  With this setting, we do this the hard way to
 		 * ensure that all the file space has really been allocated.  On
@@ -3326,15 +3329,28 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 		 * indirect blocks are down on disk.  Therefore, fdatasync(2) or
 		 * O_DSYNC will be sufficient to sync future writes to the log file.
 		 */
-		for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ)
+
+		/* Prepare to write out a lot of copies of our zero buffer at once. */
+		for (int i = 0; i < lengthof(iov); ++i)
 		{
-			errno = 0;
-			if (write(fd, zbuffer.data, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+			iov[i].iov_base = zbuffer.data;
+			iov[i].iov_len = XLOG_BLCKSZ;
+		}
+
+		/* Loop, writing as many blocks as we can for each system call. */
+		blocks = wal_segment_size / XLOG_BLCKSZ;
+		for (int i = 0; i < blocks;)
+		{
+			int 		iovcnt = Min(blocks - i, lengthof(iov));
+			off_t		offset = i * XLOG_BLCKSZ;
+
+			if (pg_pwritev_retry(fd, iov, iovcnt, offset) < 0)
 			{
-				/* if write didn't set errno, assume no disk space */
-				save_errno = errno ? errno : ENOSPC;
+				save_errno = errno;
 				break;
 			}
+
+			i += iovcnt;
 		}
 	}
 	else
-- 
2.20.1

Reply via email to