On Mon, Dec 21, 2020 at 11:40 AM Andres Freund <and...@anarazel.de> wrote: > On 2020-12-20 16:26:42 +1300, Thomas Munro wrote: > > > 1. port.h cannot assume that <limits.h> has already been included; > > > nor do I want to fix that by including <limits.h> there. Do we > > > really need to define a fallback value of IOV_MAX? If so, > > > maybe the answer is to put the replacement struct iovec and > > > IOV_MAX in some new header. > > > > Ok, I moved all this stuff into port/pg_uio.h. > > Can we come up with a better name than 'uio'? I find that a not exactly > meaningful name.
Ok, let's try port/pg_iovec.h. > Or perhaps we could just leave the functions in port/port.h, but extract > the value of the define at configure time? That way only pread.c / > pwrite.c would need it. That won't work for the struct definition, so client code would need to remember to do: #ifdef HAVE_SYS_UIO_H #include <sys/uio.h> #endif ... which is a little tedious, or port.h would need to do that and incur an overhead in every translation unit, which Tom objected to. That's why I liked the separate header idea. > > > 3. The patch as given won't prove anything except that the code > > > compiles. Is it worth fixing at least one code path to make > > > use of pg_preadv and pg_pwritev, so we can make sure this code > > > is tested before there's layers of other new code on top? > > > > OK, here's a patch to zero-fill fresh WAL segments with pwritev(). > > I think that's a good idea. However, I see two small issues: 1) If we > write larger amounts at once, we need to handle partial writes. That's a > large enough amount of IO that it's much more likely to hit a memory > shortage or such in the kernel - we had to do that a while a go for WAL > writes (which can also be larger), if memory serves. > > Perhaps we should have pg_pwritev/readv unconditionally go through > pwrite.c/pread.c and add support for "continuing" partial writes/reads > in one central place? Ok, here's a new version with the following changes: 1. Define PG_IOV_MAX, a reasonable size for local variables, not larger than IOV_MAX. 2 Use 32 rather than 1024, based on off-list complaint about 1024 potentially swamping the IO system unfairly. 3. Add a wrapper pg_pwritev_retry() to retry automatically on short writes. (I didn't write pg_preadv_retry(), because I don't currently need it for anything so I didn't want to have to think about EOF vs short-reads-for-implementation-reasons.) 4. I considered whether pg_pwrite() should have built-in retry instead of a separate wrapper, but I thought of an argument against hiding the "raw" version: the AIO patch set already understands short reads/writes and knows how to retry at a higher level, as that's needed for native AIO too, so I think it makes sense to be able to keep things the same and not solve the same problem twice. A counter argument would be that you could get the retry underway sooner with a tight loop, but I'm not expecting this to be common. > > I'm drawing a blank on trivial candidate uses for preadv(), without > > infrastructure from later patches. > > Can't immediately think of something either. One idea I had for the future is for xlogreader.c to read the WAL into a large multi-page circular buffer, which could wrap around using a pair of iovecs, but that requires lots more work .
From 3da0b25b4ce0804355f912bd026ef9c9eee146f3 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Thu, 26 Nov 2020 15:48:31 +1300 Subject: [PATCH v3 1/2] Add pg_preadv() and pg_pwritev(). Provide synchronous scatter/gather I/O routines. These map to preadv(), pwritev() with various fallbacks for systems that don't have them. Also provide a wrapper pg_pwritev_retry() which automatically retries on short write. Reviewed-by: Tom Lane <t...@sss.pgh.pa.us> Reviewed-by: Andres Freund <and...@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKGJA%2Bu-220VONeoREBXJ9P3S94Y7J%2BkqCnTYmahvZJwM%3Dg%40mail.gmail.com --- configure | 30 +--------- configure.ac | 9 ++- src/include/pg_config.h.in | 15 +++++ src/include/port.h | 2 + src/include/port/pg_iovec.h | 57 +++++++++++++++++++ src/port/Makefile | 2 + src/port/pread.c | 43 +++++++++++++- src/port/pwrite.c | 109 +++++++++++++++++++++++++++++++++++- src/tools/msvc/Solution.pm | 5 ++ 9 files changed, 238 insertions(+), 34 deletions(-) create mode 100644 src/include/port/pg_iovec.h diff --git a/configure b/configure index 11a4284e5b..5887c712cc 100755 --- a/configure +++ b/configure @@ -13061,7 +13061,7 @@ $as_echo "#define HAVE_STDBOOL_H 1" >>confdefs.h fi -for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/event.h sys/ipc.h sys/prctl.h sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/un.h termios.h ucred.h wctype.h +for ac_header in atomic.h copyfile.h execinfo.h getopt.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/event.h sys/ipc.h sys/prctl.h sys/procctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/uio.h sys/un.h termios.h ucred.h wctype.h do : as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh` ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default" @@ -15155,7 +15155,7 @@ fi LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` -for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l +for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pread preadv pstat pthread_is_threaded_np pwrite pwritev readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l writev do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" @@ -15832,32 +15832,6 @@ esac fi -ac_fn_c_check_func "$LINENO" "pread" "ac_cv_func_pread" -if test "x$ac_cv_func_pread" = xyes; then : - $as_echo "#define HAVE_PREAD 1" >>confdefs.h - -else - case " $LIBOBJS " in - *" pread.$ac_objext "* ) ;; - *) LIBOBJS="$LIBOBJS pread.$ac_objext" - ;; -esac - -fi - -ac_fn_c_check_func "$LINENO" "pwrite" "ac_cv_func_pwrite" -if test "x$ac_cv_func_pwrite" = xyes; then : - $as_echo "#define HAVE_PWRITE 1" >>confdefs.h - -else - case " $LIBOBJS " in - *" pwrite.$ac_objext "* ) ;; - *) LIBOBJS="$LIBOBJS pwrite.$ac_objext" - ;; -esac - -fi - ac_fn_c_check_func "$LINENO" "random" "ac_cv_func_random" if test "x$ac_cv_func_random" = xyes; then : $as_echo "#define HAVE_RANDOM 1" >>confdefs.h diff --git a/configure.ac b/configure.ac index fc523c6aeb..c930463225 100644 --- a/configure.ac +++ b/configure.ac @@ -1331,6 +1331,7 @@ AC_CHECK_HEADERS(m4_normalize([ sys/shm.h sys/sockio.h sys/tas.h + sys/uio.h sys/un.h termios.h ucred.h @@ -1660,9 +1661,14 @@ AC_CHECK_FUNCS(m4_normalize([ poll posix_fallocate ppoll + pread + preadv pstat pthread_is_threaded_np + pwrite + pwritev readlink + readv setproctitle setproctitle_fast setsid @@ -1673,6 +1679,7 @@ AC_CHECK_FUNCS(m4_normalize([ sync_file_range uselocale wcstombs_l + writev ])) # These typically are compiler builtins, for which AC_CHECK_FUNCS fails. @@ -1733,8 +1740,6 @@ AC_REPLACE_FUNCS(m4_normalize([ inet_aton link mkdtemp - pread - pwrite random srandom strlcat diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index de8f838e53..582cbfa012 100644 --- a/src/include/pg_config.h.in +++ b/src/include/pg_config.h.in @@ -412,6 +412,9 @@ /* Define to 1 if you have the `pread' function. */ #undef HAVE_PREAD +/* Define to 1 if you have the `preadv' function. */ +#undef HAVE_PREADV + /* Define to 1 if you have the `pstat' function. */ #undef HAVE_PSTAT @@ -430,6 +433,9 @@ /* Define to 1 if you have the `pwrite' function. */ #undef HAVE_PWRITE +/* Define to 1 if you have the `pwritev' function. */ +#undef HAVE_PWRITEV + /* Define to 1 if you have the `random' function. */ #undef HAVE_RANDOM @@ -445,6 +451,9 @@ /* Define to 1 if you have the `readlink' function. */ #undef HAVE_READLINK +/* Define to 1 if you have the `readv' function. */ +#undef HAVE_READV + /* Define to 1 if you have the global variable 'rl_completion_append_character'. */ #undef HAVE_RL_COMPLETION_APPEND_CHARACTER @@ -626,6 +635,9 @@ /* Define to 1 if you have the <sys/ucred.h> header file. */ #undef HAVE_SYS_UCRED_H +/* Define to 1 if you have the <sys/uio.h> header file. */ +#undef HAVE_SYS_UIO_H + /* Define to 1 if you have the <sys/un.h> header file. */ #undef HAVE_SYS_UN_H @@ -680,6 +692,9 @@ /* Define to 1 if you have the <winldap.h> header file. */ #undef HAVE_WINLDAP_H +/* Define to 1 if you have the `writev' function. */ +#undef HAVE_WRITEV + /* Define to 1 if you have the `X509_get_signature_nid' function. */ #undef HAVE_X509_GET_SIGNATURE_NID diff --git a/src/include/port.h b/src/include/port.h index 5dfb00b07c..87b20518c3 100644 --- a/src/include/port.h +++ b/src/include/port.h @@ -431,6 +431,8 @@ extern ssize_t pg_pread(int fd, void *buf, size_t nbyte, off_t offset); extern ssize_t pg_pwrite(int fd, const void *buf, size_t nbyte, off_t offset); #endif +/* For pg_pwritev() and pg_preadv(), see port/pg_uio.h. */ + #if !HAVE_DECL_STRLCAT extern size_t strlcat(char *dst, const char *src, size_t siz); #endif diff --git a/src/include/port/pg_iovec.h b/src/include/port/pg_iovec.h new file mode 100644 index 0000000000..035d54f840 --- /dev/null +++ b/src/include/port/pg_iovec.h @@ -0,0 +1,57 @@ +/*------------------------------------------------------------------------- + * + * pg_iovec.h + * Header for the vectored I/O functions in src/port/p{read,write}.c. + * + * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/port/pg_iovec.h + * + *------------------------------------------------------------------------- + */ +#ifndef PG_IOVEC_H +#define PG_IOVEC_H + +#include <limits.h> + +#ifdef HAVE_SYS_UIO_H +#include <sys/uio.h> +#endif + +/* If <sys/uio.h> is missing, define our own POSIX-compatible iovec struct. */ +#ifndef HAVE_SYS_UIO_H +struct iovec +{ + void *iov_base; + size_t iov_len; +}; +#endif + +/* + * If <limits.h> didn't define IOV_MAX, define our own. POSIX requires at + * least 16. + */ +#ifndef IOV_MAX +#define IOV_MAX 16 +#endif + +/* Define a reasonable maximum that is safe to use on the stack. */ +#define PG_IOV_MAX Min(IOV_MAX, 32) + +#ifdef HAVE_PREADV +#define pg_preadv preadv +#else +extern ssize_t pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset); +#endif + +#ifdef HAVE_PWRITEV +#define pg_pwritev pwritev +#else +extern ssize_t pg_pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset); +#endif + +extern ssize_t pg_pwritev_retry(int fd, const struct iovec *iov, int iovcnt, + off_t offset); + +#endif /* PG_IOVEC_H */ diff --git a/src/port/Makefile b/src/port/Makefile index e41b005c4f..bc4923ce84 100644 --- a/src/port/Makefile +++ b/src/port/Makefile @@ -53,6 +53,8 @@ OBJS = \ pgstrcasecmp.o \ pgstrsignal.o \ pqsignal.o \ + pread.o \ + pwrite.o \ qsort.o \ qsort_arg.o \ quotes.o \ diff --git a/src/port/pread.c b/src/port/pread.c index e7fecebe5f..fe0c44434d 100644 --- a/src/port/pread.c +++ b/src/port/pread.c @@ -1,7 +1,7 @@ /*------------------------------------------------------------------------- * * pread.c - * Implementation of pread(2) for platforms that lack one. + * Implementation of pread[v](2) for platforms that lack one. * * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group * @@ -9,7 +9,8 @@ * src/port/pread.c * * Note that this implementation changes the current file position, unlike - * the POSIX function, so we use the name pg_pread(). + * the POSIX function, so we use the name pg_pread(). Likewise for the + * iovec version. * *------------------------------------------------------------------------- */ @@ -23,6 +24,9 @@ #include <unistd.h> #endif +#include "port/pg_iovec.h" + +#ifndef HAVE_PREAD ssize_t pg_pread(int fd, void *buf, size_t size, off_t offset) { @@ -56,3 +60,38 @@ pg_pread(int fd, void *buf, size_t size, off_t offset) return read(fd, buf, size); #endif } +#endif + +#ifndef HAVE_PREADV +ssize_t +pg_preadv(int fd, const struct iovec *iov, int iovcnt, off_t offset) +{ +#ifdef HAVE_READV + if (iovcnt == 1) + return pg_pread(fd, iov[0].iov_base, iov[0].iov_len, offset); + if (lseek(fd, offset, SEEK_SET) < 0) + return -1; + return readv(fd, iov, iovcnt); +#else + ssize_t sum = 0; + ssize_t part; + + for (int i = 0; i < iovcnt; ++i) + { + part = pg_pread(fd, iov[i].iov_base, iov[i].iov_len, offset); + if (part < 0) + { + if (i == 0) + return -1; + else + return sum; + } + sum += part; + offset += part; + if (part < iov[i].iov_len) + return sum; + } + return sum; +#endif +} +#endif diff --git a/src/port/pwrite.c b/src/port/pwrite.c index 2e0c154462..d7ecbf362c 100644 --- a/src/port/pwrite.c +++ b/src/port/pwrite.c @@ -1,7 +1,7 @@ /*------------------------------------------------------------------------- * * pwrite.c - * Implementation of pwrite(2) for platforms that lack one. + * Implementation of pwrite[v](2) for platforms that lack one. * * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group * @@ -9,7 +9,8 @@ * src/port/pwrite.c * * Note that this implementation changes the current file position, unlike - * the POSIX function, so we use the name pg_pwrite(). + * the POSIX function, so we use the name pg_pwrite(). Likewise for the + * iovec version. * *------------------------------------------------------------------------- */ @@ -23,6 +24,9 @@ #include <unistd.h> #endif +#include "port/pg_iovec.h" + +#ifndef HAVE_PWRITE ssize_t pg_pwrite(int fd, const void *buf, size_t size, off_t offset) { @@ -53,3 +57,104 @@ pg_pwrite(int fd, const void *buf, size_t size, off_t offset) return write(fd, buf, size); #endif } +#endif + +#ifndef HAVE_PWRITEV +ssize_t +pg_pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset) +{ +#ifdef HAVE_WRITEV + if (iovcnt == 1) + return pg_pwrite(fd, iov[0].iov_base, iov[0].iov_len, offset); + if (lseek(fd, offset, SEEK_SET) < 0) + return -1; + return writev(fd, iov, iovcnt); +#else + ssize_t sum = 0; + ssize_t part; + + for (int i = 0; i < iovcnt; ++i) + { + part = pg_pwrite(fd, iov[i].iov_base, iov[i].iov_len, offset); + if (part < 0) + { + if (i == 0) + return -1; + else + return sum; + } + sum += part; + offset += part; + if (part < iov[i].iov_len) + return sum; + } + return sum; +#endif +} +#endif + +/* + * A wrapper for pg_pwritev() that retries on partial write. + */ +ssize_t +pg_pwritev_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset) +{ + struct iovec iov_copy[PG_IOV_MAX]; + ssize_t goal = 0; + ssize_t sum = 0; + ssize_t part; + + /* We'd better have space to make a copy, in case we need to retry. */ + if (iovcnt > PG_IOV_MAX) + { + errno = EINVAL; + return -1; + } + + /* How much are we trying to write? */ + for (int i = 0; i < iovcnt; ++i) + goal += iov[i].iov_len; + + for (;;) + { + /* Write as much as we can. */ + part = pg_pwritev(fd, iov, iovcnt, offset); + if (part < 0) + return -1; + +#ifdef SIMULATE_SHORT_WRITE + part = Min(part, 4096); +#endif + + /* Entirely done yet? */ + sum += part; + if (sum == goal) + break; + + /* Step over the part of the file that is done. */ + Assert(sum < goal); + Assert(iovcnt > 0); + offset += part; + + /* Step over iovecs that are done. */ + while (iov->iov_len <= part) + { + part -= iov->iov_len; + ++iov; + --iovcnt; + Assert(iovcnt > 0); + } + + /* We need a temporary copy to scribble on. */ + memmove(iov_copy, iov, sizeof(*iov) * iovcnt); + iov = iov_copy; + + /* The first remaining iovec might be partially done. Adjust it. */ + Assert(iovcnt > 0); + Assert(iov_copy[0].iov_len > part); + iov_copy[0].iov_base = (char *) iov_copy[0].iov_base + part; + iov_copy[0].iov_len -= part; + } + + return sum; +} diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm index 22d6abd367..832a363282 100644 --- a/src/tools/msvc/Solution.pm +++ b/src/tools/msvc/Solution.pm @@ -329,17 +329,20 @@ sub GenerateFiles HAVE_PPC_LWARX_MUTEX_HINT => undef, HAVE_PPOLL => undef, HAVE_PREAD => undef, + HAVE_PREADV => undef, HAVE_PSTAT => undef, HAVE_PS_STRINGS => undef, HAVE_PTHREAD => undef, HAVE_PTHREAD_IS_THREADED_NP => undef, HAVE_PTHREAD_PRIO_INHERIT => undef, HAVE_PWRITE => undef, + HAVE_PWRITEV => undef, HAVE_RANDOM => undef, HAVE_READLINE_H => undef, HAVE_READLINE_HISTORY_H => undef, HAVE_READLINE_READLINE_H => undef, HAVE_READLINK => undef, + HAVE_READV => undef, HAVE_RL_COMPLETION_APPEND_CHARACTER => undef, HAVE_RL_COMPLETION_MATCHES => undef, HAVE_RL_COMPLETION_SUPPRESS_QUOTE => undef, @@ -400,6 +403,7 @@ sub GenerateFiles HAVE_SYS_TYPES_H => 1, HAVE_SYS_UCRED_H => undef, HAVE_SYS_UN_H => undef, + HAVE_SYS_UIO_H => undef, HAVE_TERMIOS_H => undef, HAVE_TYPEOF => undef, HAVE_UCRED_H => undef, @@ -417,6 +421,7 @@ sub GenerateFiles HAVE_WINLDAP_H => undef, HAVE_WCSTOMBS_L => 1, HAVE_WCTYPE_H => 1, + HAVE_WRITEV => undef, HAVE_X509_GET_SIGNATURE_NID => 1, HAVE_X86_64_POPCNTQ => undef, HAVE__BOOL => undef, -- 2.20.1
From 01e2b824a2092d2bcbf5e53178ba08ef0abfe354 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.mu...@gmail.com> Date: Sun, 20 Dec 2020 13:16:21 +1300 Subject: [PATCH v3 2/2] Use vectored I/O to zero WAL segments. Instead of making many block-sized write() calls to fill a 16MB WAL file with zeroes, make a smaller number of pwritev() calls if available. The actual number depends on the OS's IOV_MAX, but we cap it at 32 so we write 256kB per call on typical systems. Reviewed-by: Tom Lane <t...@sss.pgh.pa.us> Reviewed-by: Andres Freund <and...@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKGJA%2Bu-220VONeoREBXJ9P3S94Y7J%2BkqCnTYmahvZJwM%3Dg%40mail.gmail.com --- src/backend/access/transam/xlog.c | 28 ++++++++++++++++++++++------ 1 file changed, 22 insertions(+), 6 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index b1e5d2dbff..d7b3d6d5bd 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -48,6 +48,7 @@ #include "pg_trace.h" #include "pgstat.h" #include "port/atomics.h" +#include "port/pg_iovec.h" #include "postmaster/bgwriter.h" #include "postmaster/startup.h" #include "postmaster/walwriter.h" @@ -3270,7 +3271,6 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock) XLogSegNo installed_segno; XLogSegNo max_segno; int fd; - int nbytes; int save_errno; XLogFilePath(path, ThisTimeLineID, logsegno, wal_segment_size); @@ -3317,6 +3317,9 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock) save_errno = 0; if (wal_init_zero) { + struct iovec iov[PG_IOV_MAX]; + int blocks; + /* * Zero-fill the file. With this setting, we do this the hard way to * ensure that all the file space has really been allocated. On @@ -3326,15 +3329,28 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock) * indirect blocks are down on disk. Therefore, fdatasync(2) or * O_DSYNC will be sufficient to sync future writes to the log file. */ - for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ) + + /* Prepare to write out a lot of copies of our zero buffer at once. */ + for (int i = 0; i < lengthof(iov); ++i) { - errno = 0; - if (write(fd, zbuffer.data, XLOG_BLCKSZ) != XLOG_BLCKSZ) + iov[i].iov_base = zbuffer.data; + iov[i].iov_len = XLOG_BLCKSZ; + } + + /* Loop, writing as many blocks as we can for each system call. */ + blocks = wal_segment_size / XLOG_BLCKSZ; + for (int i = 0; i < blocks;) + { + int iovcnt = Min(blocks - i, lengthof(iov)); + off_t offset = i * XLOG_BLCKSZ; + + if (pg_pwritev_retry(fd, iov, iovcnt, offset) < 0) { - /* if write didn't set errno, assume no disk space */ - save_errno = errno ? errno : ENOSPC; + save_errno = errno; break; } + + i += iovcnt; } } else -- 2.20.1