On Mon, Feb 6, 2023 at 3:29 AM Andres Freund <and...@anarazel.de> wrote: > On February 5, 2023 1:00:50 AM GMT+01:00, Thomas Munro > <thomas.mu...@gmail.com> wrote: > >Are there any more descriptors we need to think about? > > Postmaster's listen sockets? Saves a bunch of syscalls, at least.
Assuming you mean accepted sockets, yeah, I see how two save two syscalls there, and since you nerd-sniped me into looking into the SOCK_CLOEXEC landscape, I like it even more now that I've understood that accept4() is rubber-stamped for the next revision of POSIX[1] and is already accepted almost everywhere. It's not just window dressing, you need it to write multi-threaded programs that fork/exec without worrying about the window between fd creation and fcntl(FD_CLOEXEC) in another thread; hopefully one day we will care about that sort of thing in some places too! Here's a separate patch for that. I *guess* we need HAVE_DECL_ACCEPT4 for the guarded availability system (cf pwritev) when Apple gets the memo, but see below. Hard to say if AIX is still receiving memos (cf recent speculation in the Register). All other target OSes seem to have had this stuff for a while. Since client connections already do fcntl(FD_CLOEXEC), postgres_fdw connections didn't have this problem. It seems reasonable to want to skip a couple of system calls there too; also, client programs might also be interested in future-POSIX's atomic race-free close-on-exec socket fabrication. So here also is a patch to use SOCK_CLOEXEC on that end too, if available. But ... hmph, all we can do here is test for the existence of SOCK_NONBLOCK and SOCK_CLOEXEC, since there is no new function to test for. Maybe we should just assume accept4() also exists if these exist (it's hard to imagine that Apple or IBM would address atomicity on one end but not the other of a socket), but predictions are so difficult, especially about the future! Anyone want to guess if it's better to leave the meson/configure probe in for the accept4 end or just roll with the macros? > Logging collector pipe write end, in backends? The write end of the logging pipe is already closed, after dup2'ing it to STDOUT_FILENO to STDERR_FILENO, so archive commands and suchlike do receive the handle, but they want them. It's the intended and documented behaviour that anything written to that will finish up in the log. As for pipe2(O_CLOEXEC), I see the point of it in a multi-threaded application. It's not terribly useful for us though, because we usually want to close only one end, except in the case of the self-pipe. But the self-pipe is no longer used on the systems that have pipe2()-from-the-future. I haven't tested this under EXEC_BACKEND yet. [1] https://www.austingroupbugs.net/view.php?id=411
From c9a61ba8eaddb197c31095a163f7efdf2e3ea797 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.munro@gmail.com> Date: Sun, 5 Feb 2023 12:18:14 +1300 Subject: [PATCH v2 1/3] Don't leak descriptors into subprograms. Open data and WAL files with O_CLOEXEC (POSIX 2018, SUSv4). All of our target systems have it, except Windows. Windows already has that behavior, because we wrote our own open() implementation. Do the same for sockets with FD_CLOEXEC. (A separate commit will improve this with SOCK_CLOEXEC on some systems, but we'll need the FD_CLOEXEC path as a fallback.) This means that programs executed by COPY and archiving commands etc, will not be able to mess with those descriptors (of course nothing stops them from opening files, so this is more about tidyness and preventing accidental problems than security). Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com --- src/backend/access/transam/xlog.c | 9 ++++++--- src/backend/libpq/pqcomm.c | 9 +++++++++ src/backend/storage/file/fd.c | 2 -- src/backend/storage/smgr/md.c | 9 +++++---- src/include/port/win32_port.h | 7 +++++++ 5 files changed, 27 insertions(+), 9 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index fb4c860bde..3b44a0f237 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2937,7 +2937,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli, * Try to use existent file (checkpoint maker may have created it already) */ *added = false; - fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method)); + fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC | + get_sync_bit(sync_method)); if (fd < 0) { if (errno != ENOENT) @@ -3098,7 +3099,8 @@ XLogFileInit(XLogSegNo logsegno, TimeLineID logtli) return fd; /* Now open original target segment (might not be file I just made) */ - fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method)); + fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC | + get_sync_bit(sync_method)); if (fd < 0) ereport(ERROR, (errcode_for_file_access(), @@ -3329,7 +3331,8 @@ XLogFileOpen(XLogSegNo segno, TimeLineID tli) XLogFilePath(path, tli, segno, wal_segment_size); - fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method)); + fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC | + get_sync_bit(sync_method)); if (fd < 0) ereport(PANIC, (errcode_for_file_access(), diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index 864c9debe8..d4622533fd 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -200,6 +200,15 @@ pq_init(void) (errmsg("could not set socket to nonblocking mode: %m"))); #endif +#ifndef WIN32 + /* + * Also make sure that any subprocesses that execute a new program don't + * inherit the socket. + */ + if (fcntl(MyProcPort->sock, F_SETFD, FD_CLOEXEC) < 0) + elog(FATAL, "fcntl(F_SETFD) failed on socket: %m"); +#endif + FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, FeBeWaitSetNEvents); socket_pos = AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL, NULL); diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 926d000f2e..933f17b398 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -1033,10 +1033,8 @@ tryAgain: O_TRUNC | O_WRONLY)) == 0, "PG_O_DIRECT value collides with standard flag"); -#if defined(O_CLOEXEC) StaticAssertStmt((PG_O_DIRECT & O_CLOEXEC) == 0, "PG_O_DIRECT value collides with O_CLOEXEC"); -#endif #if defined(O_DSYNC) StaticAssertStmt((PG_O_DIRECT & O_DSYNC) == 0, "PG_O_DIRECT value collides with O_DSYNC"); diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c index 60c9905eff..c464d6338c 100644 --- a/src/backend/storage/smgr/md.c +++ b/src/backend/storage/smgr/md.c @@ -205,14 +205,15 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo) path = relpath(reln->smgr_rlocator, forknum); - fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY); + fd = PathNameOpenFile(path, + O_RDWR | O_CREAT | O_EXCL | PG_BINARY | O_CLOEXEC); if (fd < 0) { int save_errno = errno; if (isRedo) - fd = PathNameOpenFile(path, O_RDWR | PG_BINARY); + fd = PathNameOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC); if (fd < 0) { /* be sure to report the error reported by create, not open */ @@ -523,7 +524,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior) path = relpath(reln->smgr_rlocator, forknum); - fd = PathNameOpenFile(path, O_RDWR | PG_BINARY); + fd = PathNameOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC); if (fd < 0) { @@ -1188,7 +1189,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno, fullpath = _mdfd_segpath(reln, forknum, segno); /* open the file */ - fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | oflags); + fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | O_CLOEXEC | oflags); pfree(fullpath); diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h index 9488195799..a8685f1520 100644 --- a/src/include/port/win32_port.h +++ b/src/include/port/win32_port.h @@ -353,6 +353,13 @@ extern int _pglstat64(const char *name, struct stat *buf); */ #define O_DSYNC 0x0080 +/* + * Our open wrapper does not create inheritable handles, so it is safe to + * ignore O_CLOEXEC. (If we were using Windows' own open(), it might be + * necessary to convert this to _O_NOINHERIT.) + */ +#define O_CLOEXEC 0 + /* * Supplement to <errno.h>. * -- 2.39.1
From fd2bb0e83193fc337a8fc7e9e369387630252bd0 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.munro@gmail.com> Date: Mon, 6 Feb 2023 13:51:19 +1300 Subject: [PATCH v2 2/3] Use accept4() to accept connections, where available. The postmaster previously accepted connections and then made extra system calls in child processes to make them non-blocking and (as of a recent commit) close-on-exit. On nearly all systems, that can be done atomically with flags to the newer accept4() variant (recently accepted by POSIX for a future revision, but around in the wild for a while now), saving a couple of system calls. The exceptions are Windows, where we didn't have to worry about that problem anyway, EXEC_BACKEND builds on Unix (used by developers only for slightly-more-like-Windows testing), and the straggler Unixes that haven't implemented it yet (at the time of writing: macOS and AIX). Suggested-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com --- configure | 12 ++++++++++++ configure.ac | 1 + meson.build | 1 + src/backend/libpq/pqcomm.c | 33 +++++++++++++++++++++++++++------ src/include/pg_config.h.in | 4 ++++ 5 files changed, 45 insertions(+), 6 deletions(-) diff --git a/configure b/configure index 5d07fd0bb9..514da7c30f 100755 --- a/configure +++ b/configure @@ -16207,6 +16207,18 @@ _ACEOF # We can't use AC_REPLACE_FUNCS to replace these functions, because it # won't handle deployment target restrictions on macOS +ac_fn_c_check_decl "$LINENO" "accept4" "ac_cv_have_decl_accept4" "#include <sys/socket.h> +" +if test "x$ac_cv_have_decl_accept4" = xyes; then : + ac_have_decl=1 +else + ac_have_decl=0 +fi + +cat >>confdefs.h <<_ACEOF +#define HAVE_DECL_ACCEPT4 $ac_have_decl +_ACEOF + ac_fn_c_check_decl "$LINENO" "preadv" "ac_cv_have_decl_preadv" "#include <sys/uio.h> " if test "x$ac_cv_have_decl_preadv" = xyes; then : diff --git a/configure.ac b/configure.ac index e9b74ced6c..6c10592fea 100644 --- a/configure.ac +++ b/configure.ac @@ -1841,6 +1841,7 @@ AC_CHECK_DECLS([strlcat, strlcpy, strnlen]) # We can't use AC_REPLACE_FUNCS to replace these functions, because it # won't handle deployment target restrictions on macOS +AC_CHECK_DECLS([accept4], [], [], [#include <sys/socket.h>]) AC_CHECK_DECLS([preadv], [], [AC_LIBOBJ(preadv)], [#include <sys/uio.h>]) AC_CHECK_DECLS([pwritev], [], [AC_LIBOBJ(pwritev)], [#include <sys/uio.h>]) diff --git a/meson.build b/meson.build index e379a252a5..c20a339340 100644 --- a/meson.build +++ b/meson.build @@ -2094,6 +2094,7 @@ decl_checks = [ # checking for library symbols wouldn't handle deployment target # restrictions on macOS decl_checks += [ + ['accept4', 'sys/socket.h'], ['preadv', 'sys/uio.h'], ['pwritev', 'sys/uio.h'], ] diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index d4622533fd..70734f1a4b 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -184,6 +184,14 @@ pq_init(void) /* set up process-exit hook to close the socket */ on_proc_exit(socket_close, 0); +#ifndef WIN32 + /* + * On most systems, the socket has already been made non-blocking and + * close-on-exec by StreamConnection(). On systems that don't have + * accept4() yet, and EXEC_BACKEND builds that need the socket to survive + * exec*(), we do that separately here in the child process. + */ +#if !HAVE_DECL_ACCEPT4 || defined(EXEC_BACKEND) /* * In backends (as soon as forked) we operate the underlying socket in * nonblocking mode and use latches to implement blocking semantics if @@ -194,19 +202,17 @@ pq_init(void) * the client, which might require changing the mode again, leading to * infinite recursion. */ -#ifndef WIN32 if (!pg_set_noblock(MyProcPort->sock)) ereport(COMMERROR, (errmsg("could not set socket to nonblocking mode: %m"))); -#endif -#ifndef WIN32 /* * Also make sure that any subprocesses that execute a new program don't * inherit the socket. */ if (fcntl(MyProcPort->sock, F_SETFD, FD_CLOEXEC) < 0) elog(FATAL, "fcntl(F_SETFD) failed on socket: %m"); +#endif #endif FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, FeBeWaitSetNEvents); @@ -701,9 +707,24 @@ StreamConnection(pgsocket server_fd, Port *port) { /* accept connection and fill in the client (remote) address */ port->raddr.salen = sizeof(port->raddr.addr); - if ((port->sock = accept(server_fd, - (struct sockaddr *) &port->raddr.addr, - &port->raddr.salen)) == PGINVALID_SOCKET) + +#if HAVE_DECL_ACCEPT4 && !defined(EXEC_BACKEND) + /* + * If we have accept4(), we can avoid a couple of fcntl() calls in + * pq_init(). We can't use this optimization in EXEC_BACKEND builds, + * though, because they need the socket to survive exec(). + */ + port->sock = accept4(server_fd, + (struct sockaddr *) &port->raddr.addr, + &port->raddr.salen, + SOCK_CLOEXEC | SOCK_NONBLOCK); +#else + port->sock = accept(server_fd, + (struct sockaddr *) &port->raddr.addr, + &port->raddr.salen); +#endif + + if (port->sock == PGINVALID_SOCKET) { ereport(LOG, (errcode_for_socket_access(), diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index c5a80b829e..b4777d7c77 100644 --- a/src/include/pg_config.h.in +++ b/src/include/pg_config.h.in @@ -91,6 +91,10 @@ /* Define to 1 if you have the `CRYPTO_lock' function. */ #undef HAVE_CRYPTO_LOCK +/* Define to 1 if you have the declaration of `accept4', and to 0 if you + don't. */ +#undef HAVE_DECL_ACCEPT4 + /* Define to 1 if you have the declaration of `fdatasync', and to 0 if you don't. */ #undef HAVE_DECL_FDATASYNC -- 2.39.1
From 8cd306114867b24640e70491eb35f61bf6771457 Mon Sep 17 00:00:00 2001 From: Thomas Munro <thomas.munro@gmail.com> Date: Mon, 6 Feb 2023 14:54:03 +1300 Subject: [PATCH v2 3/3] Use newer client socket options where available. As a recent commit did for the server side, we can now also try to set client sockets to nonblocking and close-on-exit atomically when creating the socket. This uses flags expected in the next POSIX revision, but already widely available on common platforms. This saves a couple of fcntl() calls, and plugs a tiny window for resource leaks in multi-threaded programs that might fork-and-exec while another thread would otherwise have to run socket() and then fcntl(). Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com --- src/interfaces/libpq/fe-connect.c | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index 50b5df3490..e25ba94863 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -2464,6 +2464,7 @@ keep_going: /* We will come back to here until there is { struct addrinfo *addr_cur = conn->addr_cur; char host_addr[NI_MAXHOST]; + int sock_type; /* * Advance to next possible host, if we've tried all of @@ -2494,7 +2495,23 @@ keep_going: /* We will come back to here until there is conn->connip = strdup(host_addr); /* Try to create the socket */ - conn->sock = socket(addr_cur->ai_family, SOCK_STREAM, 0); + sock_type = SOCK_STREAM; +#ifdef SOCK_CLOEXEC + /* + * Atomically mark close-on-exec, if possible on this + * platform, for the benefit of multi-threaded programs. + * Otherwise see below. + */ + sock_type |= SOCK_CLOEXEC; +#endif +#ifdef SOCK_NONBLOCK + /* + * We might as well skip a system call for nonblocking mode + * too, if we can. + */ + sock_type |= SOCK_NONBLOCK; +#endif + conn->sock = socket(addr_cur->ai_family, sock_type, 0); if (conn->sock == PGINVALID_SOCKET) { int errorno = SOCK_ERRNO; @@ -2540,6 +2557,7 @@ keep_going: /* We will come back to here until there is goto keep_going; } } +#ifndef SOCK_NONBLOCK if (!pg_set_noblock(conn->sock)) { libpq_append_conn_error(conn, "could not set socket to nonblocking mode: %s", @@ -2547,7 +2565,9 @@ keep_going: /* We will come back to here until there is conn->try_next_addr = true; goto keep_going; } +#endif +#ifndef SOCK_CLOEXEC #ifdef F_SETFD if (fcntl(conn->sock, F_SETFD, FD_CLOEXEC) == -1) { @@ -2557,6 +2577,7 @@ keep_going: /* We will come back to here until there is goto keep_going; } #endif /* F_SETFD */ +#endif if (addr_cur->ai_family != AF_UNIX) { -- 2.39.1