On Mon, Feb 6, 2023 at 3:29 AM Andres Freund <and...@anarazel.de> wrote:
> On February 5, 2023 1:00:50 AM GMT+01:00, Thomas Munro 
> <thomas.mu...@gmail.com> wrote:
> >Are there any more descriptors we need to think about?
>
> Postmaster's listen sockets? Saves a bunch of syscalls, at least.

Assuming you mean accepted sockets, yeah, I see how two save two
syscalls there, and since you nerd-sniped me into looking into the
SOCK_CLOEXEC landscape, I like it even more now that I've understood
that accept4() is rubber-stamped for the next revision of POSIX[1] and
is already accepted almost everywhere.  It's not just window dressing,
you need it to write multi-threaded programs that fork/exec without
worrying about the window between fd creation and fcntl(FD_CLOEXEC) in
another thread; hopefully one day we will care about that sort of
thing in some places too!  Here's a separate patch for that.

I *guess* we need HAVE_DECL_ACCEPT4 for the guarded availability
system (cf pwritev) when Apple gets the memo, but see below.  Hard to
say if AIX is still receiving memos (cf recent speculation in the
Register).  All other target OSes seem to have had this stuff for a
while.

Since client connections already do fcntl(FD_CLOEXEC), postgres_fdw
connections didn't have this problem.  It seems reasonable to want to
skip a couple of system calls there too; also, client programs might
also be interested in future-POSIX's atomic race-free close-on-exec
socket fabrication.  So here also is a patch to use SOCK_CLOEXEC on
that end too, if available.

But ... hmph, all we can do here is test for the existence of
SOCK_NONBLOCK and SOCK_CLOEXEC, since there is no new function to test
for.  Maybe we should just assume accept4() also exists if these exist
(it's hard to imagine that Apple or IBM would address atomicity on one
end but not the other of a socket), but predictions are so difficult,
especially about the future!  Anyone want to guess if it's better to
leave the meson/configure probe in for the accept4 end or just roll
with the macros?

> Logging collector pipe write end, in backends?

The write end of the logging pipe is already closed, after dup2'ing it
to STDOUT_FILENO to STDERR_FILENO, so archive commands and suchlike do
receive the handle, but they want them.  It's the intended and
documented behaviour that anything written to that will finish up in
the log.

As for pipe2(O_CLOEXEC), I see the point of it in a multi-threaded
application.  It's not terribly useful for us though, because we
usually want to close only one end, except in the case of the
self-pipe.  But the self-pipe is no longer used on the systems that
have pipe2()-from-the-future.

I haven't tested this under EXEC_BACKEND yet.

[1] https://www.austingroupbugs.net/view.php?id=411
From c9a61ba8eaddb197c31095a163f7efdf2e3ea797 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sun, 5 Feb 2023 12:18:14 +1300
Subject: [PATCH v2 1/3] Don't leak descriptors into subprograms.

Open data and WAL files with O_CLOEXEC (POSIX 2018, SUSv4).  All of our
target systems have it, except Windows.  Windows already has that
behavior, because we wrote our own open() implementation.

Do the same for sockets with FD_CLOEXEC.  (A separate commit will
improve this with SOCK_CLOEXEC on some systems, but we'll need the
FD_CLOEXEC path as a fallback.)

This means that programs executed by COPY and archiving commands etc,
will not be able to mess with those descriptors (of course nothing stops
them from opening files, so this is more about tidyness and preventing
accidental problems than security).

Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 src/backend/access/transam/xlog.c | 9 ++++++---
 src/backend/libpq/pqcomm.c        | 9 +++++++++
 src/backend/storage/file/fd.c     | 2 --
 src/backend/storage/smgr/md.c     | 9 +++++----
 src/include/port/win32_port.h     | 7 +++++++
 5 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fb4c860bde..3b44a0f237 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2937,7 +2937,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	 * Try to use existent file (checkpoint maker may have created it already)
 	 */
 	*added = false;
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC |
+							 get_sync_bit(sync_method));
 	if (fd < 0)
 	{
 		if (errno != ENOENT)
@@ -3098,7 +3099,8 @@ XLogFileInit(XLogSegNo logsegno, TimeLineID logtli)
 		return fd;
 
 	/* Now open original target segment (might not be file I just made) */
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC |
+							 get_sync_bit(sync_method));
 	if (fd < 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -3329,7 +3331,8 @@ XLogFileOpen(XLogSegNo segno, TimeLineID tli)
 
 	XLogFilePath(path, tli, segno, wal_segment_size);
 
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC |
+							 get_sync_bit(sync_method));
 	if (fd < 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 864c9debe8..d4622533fd 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -200,6 +200,15 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
+#ifndef WIN32
+	/*
+	 * Also make sure that any subprocesses that execute a new program don't
+	 * inherit the socket.
+	 */
+	if (fcntl(MyProcPort->sock, F_SETFD, FD_CLOEXEC) < 0)
+		elog(FATAL, "fcntl(F_SETFD) failed on socket: %m");
+#endif
+
 	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, FeBeWaitSetNEvents);
 	socket_pos = AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE,
 								   MyProcPort->sock, NULL, NULL);
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 926d000f2e..933f17b398 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1033,10 +1033,8 @@ tryAgain:
 					   O_TRUNC |
 					   O_WRONLY)) == 0,
 					 "PG_O_DIRECT value collides with standard flag");
-#if defined(O_CLOEXEC)
 	StaticAssertStmt((PG_O_DIRECT & O_CLOEXEC) == 0,
 					 "PG_O_DIRECT value collides with O_CLOEXEC");
-#endif
 #if defined(O_DSYNC)
 	StaticAssertStmt((PG_O_DIRECT & O_DSYNC) == 0,
 					 "PG_O_DIRECT value collides with O_DSYNC");
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 60c9905eff..c464d6338c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -205,14 +205,15 @@ mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo)
 
 	path = relpath(reln->smgr_rlocator, forknum);
 
-	fd = PathNameOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	fd = PathNameOpenFile(path,
+						  O_RDWR | O_CREAT | O_EXCL | PG_BINARY | O_CLOEXEC);
 
 	if (fd < 0)
 	{
 		int			save_errno = errno;
 
 		if (isRedo)
-			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC);
 		if (fd < 0)
 		{
 			/* be sure to report the error reported by create, not open */
@@ -523,7 +524,7 @@ mdopenfork(SMgrRelation reln, ForkNumber forknum, int behavior)
 
 	path = relpath(reln->smgr_rlocator, forknum);
 
-	fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+	fd = PathNameOpenFile(path, O_RDWR | PG_BINARY | O_CLOEXEC);
 
 	if (fd < 0)
 	{
@@ -1188,7 +1189,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 	fullpath = _mdfd_segpath(reln, forknum, segno);
 
 	/* open the file */
-	fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | oflags);
+	fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | O_CLOEXEC | oflags);
 
 	pfree(fullpath);
 
diff --git a/src/include/port/win32_port.h b/src/include/port/win32_port.h
index 9488195799..a8685f1520 100644
--- a/src/include/port/win32_port.h
+++ b/src/include/port/win32_port.h
@@ -353,6 +353,13 @@ extern int	_pglstat64(const char *name, struct stat *buf);
  */
 #define O_DSYNC 0x0080
 
+/*
+ * Our open wrapper does not create inheritable handles, so it is safe to
+ * ignore O_CLOEXEC.  (If we were using Windows' own open(), it might be
+ * necessary to convert this to _O_NOINHERIT.)
+ */
+#define O_CLOEXEC 0
+
 /*
  * Supplement to <errno.h>.
  *
-- 
2.39.1

From fd2bb0e83193fc337a8fc7e9e369387630252bd0 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 6 Feb 2023 13:51:19 +1300
Subject: [PATCH v2 2/3] Use accept4() to accept connections, where available.

The postmaster previously accepted connections and then made extra
system calls in child processes to make them non-blocking and (as of a
recent commit) close-on-exit.

On nearly all systems, that can be done atomically with flags to the
newer accept4() variant (recently accepted by POSIX for a future
revision, but around in the wild for a while now), saving a couple of
system calls.

The exceptions are Windows, where we didn't have to worry about that
problem anyway, EXEC_BACKEND builds on Unix (used by developers only for
slightly-more-like-Windows testing), and the straggler Unixes that
haven't implemented it yet (at the time of writing: macOS and AIX).

Suggested-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 configure                  | 12 ++++++++++++
 configure.ac               |  1 +
 meson.build                |  1 +
 src/backend/libpq/pqcomm.c | 33 +++++++++++++++++++++++++++------
 src/include/pg_config.h.in |  4 ++++
 5 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/configure b/configure
index 5d07fd0bb9..514da7c30f 100755
--- a/configure
+++ b/configure
@@ -16207,6 +16207,18 @@ _ACEOF
 
 # We can't use AC_REPLACE_FUNCS to replace these functions, because it
 # won't handle deployment target restrictions on macOS
+ac_fn_c_check_decl "$LINENO" "accept4" "ac_cv_have_decl_accept4" "#include <sys/socket.h>
+"
+if test "x$ac_cv_have_decl_accept4" = xyes; then :
+  ac_have_decl=1
+else
+  ac_have_decl=0
+fi
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE_DECL_ACCEPT4 $ac_have_decl
+_ACEOF
+
 ac_fn_c_check_decl "$LINENO" "preadv" "ac_cv_have_decl_preadv" "#include <sys/uio.h>
 "
 if test "x$ac_cv_have_decl_preadv" = xyes; then :
diff --git a/configure.ac b/configure.ac
index e9b74ced6c..6c10592fea 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1841,6 +1841,7 @@ AC_CHECK_DECLS([strlcat, strlcpy, strnlen])
 
 # We can't use AC_REPLACE_FUNCS to replace these functions, because it
 # won't handle deployment target restrictions on macOS
+AC_CHECK_DECLS([accept4], [], [], [#include <sys/socket.h>])
 AC_CHECK_DECLS([preadv], [], [AC_LIBOBJ(preadv)], [#include <sys/uio.h>])
 AC_CHECK_DECLS([pwritev], [], [AC_LIBOBJ(pwritev)], [#include <sys/uio.h>])
 
diff --git a/meson.build b/meson.build
index e379a252a5..c20a339340 100644
--- a/meson.build
+++ b/meson.build
@@ -2094,6 +2094,7 @@ decl_checks = [
 # checking for library symbols wouldn't handle deployment target
 # restrictions on macOS
 decl_checks += [
+  ['accept4', 'sys/socket.h'],
   ['preadv', 'sys/uio.h'],
   ['pwritev', 'sys/uio.h'],
 ]
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index d4622533fd..70734f1a4b 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -184,6 +184,14 @@ pq_init(void)
 	/* set up process-exit hook to close the socket */
 	on_proc_exit(socket_close, 0);
 
+#ifndef WIN32
+	/*
+	 * On most systems, the socket has already been made non-blocking and
+	 * close-on-exec by StreamConnection().  On systems that don't have
+	 * accept4() yet, and EXEC_BACKEND builds that need the socket to survive
+	 * exec*(), we do that separately here in the child process.
+	 */
+#if !HAVE_DECL_ACCEPT4 || defined(EXEC_BACKEND)
 	/*
 	 * In backends (as soon as forked) we operate the underlying socket in
 	 * nonblocking mode and use latches to implement blocking semantics if
@@ -194,19 +202,17 @@ pq_init(void)
 	 * the client, which might require changing the mode again, leading to
 	 * infinite recursion.
 	 */
-#ifndef WIN32
 	if (!pg_set_noblock(MyProcPort->sock))
 		ereport(COMMERROR,
 				(errmsg("could not set socket to nonblocking mode: %m")));
-#endif
 
-#ifndef WIN32
 	/*
 	 * Also make sure that any subprocesses that execute a new program don't
 	 * inherit the socket.
 	 */
 	if (fcntl(MyProcPort->sock, F_SETFD, FD_CLOEXEC) < 0)
 		elog(FATAL, "fcntl(F_SETFD) failed on socket: %m");
+#endif
 #endif
 
 	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, FeBeWaitSetNEvents);
@@ -701,9 +707,24 @@ StreamConnection(pgsocket server_fd, Port *port)
 {
 	/* accept connection and fill in the client (remote) address */
 	port->raddr.salen = sizeof(port->raddr.addr);
-	if ((port->sock = accept(server_fd,
-							 (struct sockaddr *) &port->raddr.addr,
-							 &port->raddr.salen)) == PGINVALID_SOCKET)
+
+#if HAVE_DECL_ACCEPT4 && !defined(EXEC_BACKEND)
+	/*
+	 * If we have accept4(), we can avoid a couple of fcntl() calls in
+	 * pq_init().  We can't use this optimization in EXEC_BACKEND builds,
+	 * though, because they need the socket to survive exec().
+	 */
+	port->sock = accept4(server_fd,
+						 (struct sockaddr *) &port->raddr.addr,
+						 &port->raddr.salen,
+						 SOCK_CLOEXEC | SOCK_NONBLOCK);
+#else
+	port->sock = accept(server_fd,
+						(struct sockaddr *) &port->raddr.addr,
+						&port->raddr.salen);
+#endif
+
+	if (port->sock == PGINVALID_SOCKET)
 	{
 		ereport(LOG,
 				(errcode_for_socket_access(),
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c5a80b829e..b4777d7c77 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -91,6 +91,10 @@
 /* Define to 1 if you have the `CRYPTO_lock' function. */
 #undef HAVE_CRYPTO_LOCK
 
+/* Define to 1 if you have the declaration of `accept4', and to 0 if you
+   don't. */
+#undef HAVE_DECL_ACCEPT4
+
 /* Define to 1 if you have the declaration of `fdatasync', and to 0 if you
    don't. */
 #undef HAVE_DECL_FDATASYNC
-- 
2.39.1

From 8cd306114867b24640e70491eb35f61bf6771457 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 6 Feb 2023 14:54:03 +1300
Subject: [PATCH v2 3/3] Use newer client socket options where available.

As a recent commit did for the server side, we can now also try to set
client sockets to nonblocking and close-on-exit atomically when creating
the socket.  This uses flags expected in the next POSIX revision, but
already widely available on common platforms.

This saves a couple of fcntl() calls, and plugs a tiny window for
resource leaks in multi-threaded programs that might fork-and-exec while
another thread would otherwise have to run socket() and then fcntl().

Discussion: https://postgr.es/m/CA%2BhUKGKb6FsAdQWcRL35KJsftv%2B9zXqQbzwkfRf1i0J2e57%2BhQ%40mail.gmail.com
---
 src/interfaces/libpq/fe-connect.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 50b5df3490..e25ba94863 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -2464,6 +2464,7 @@ keep_going:						/* We will come back to here until there is
 				{
 					struct addrinfo *addr_cur = conn->addr_cur;
 					char		host_addr[NI_MAXHOST];
+					int			sock_type;
 
 					/*
 					 * Advance to next possible host, if we've tried all of
@@ -2494,7 +2495,23 @@ keep_going:						/* We will come back to here until there is
 						conn->connip = strdup(host_addr);
 
 					/* Try to create the socket */
-					conn->sock = socket(addr_cur->ai_family, SOCK_STREAM, 0);
+					sock_type = SOCK_STREAM;
+#ifdef SOCK_CLOEXEC
+					/*
+					 * Atomically mark close-on-exec, if possible on this
+					 * platform, for the benefit of multi-threaded programs.
+					 * Otherwise see below.
+					 */
+					sock_type |= SOCK_CLOEXEC;
+#endif
+#ifdef SOCK_NONBLOCK
+					/*
+					 * We might as well skip a system call for nonblocking mode
+					 * too, if we can.
+					 */
+					sock_type |= SOCK_NONBLOCK;
+#endif
+					conn->sock = socket(addr_cur->ai_family, sock_type, 0);
 					if (conn->sock == PGINVALID_SOCKET)
 					{
 						int			errorno = SOCK_ERRNO;
@@ -2540,6 +2557,7 @@ keep_going:						/* We will come back to here until there is
 							goto keep_going;
 						}
 					}
+#ifndef SOCK_NONBLOCK
 					if (!pg_set_noblock(conn->sock))
 					{
 						libpq_append_conn_error(conn, "could not set socket to nonblocking mode: %s",
@@ -2547,7 +2565,9 @@ keep_going:						/* We will come back to here until there is
 						conn->try_next_addr = true;
 						goto keep_going;
 					}
+#endif
 
+#ifndef SOCK_CLOEXEC
 #ifdef F_SETFD
 					if (fcntl(conn->sock, F_SETFD, FD_CLOEXEC) == -1)
 					{
@@ -2557,6 +2577,7 @@ keep_going:						/* We will come back to here until there is
 						goto keep_going;
 					}
 #endif							/* F_SETFD */
+#endif
 
 					if (addr_cur->ai_family != AF_UNIX)
 					{
-- 
2.39.1

Reply via email to