On Mon, 17 Feb 2025 at 18:24, Andres Freund <and...@anarazel.de> wrote:
> Why not do this only in the if (rlim.rlim_cur < nclients + 3) case?

Done, I also changed it to not bump to rlim_max, but only to nclients
+ 3. The rest of the patches I'll update later. But response below.

> I think it might be better to have a precursor commit that wraps system(),
> including the fflush(), pgstat_report_wait_start(), pgstat_report_wait_end().
>
> It seems a bit silly to have a dance of 5 function calls that are repeated in
> three places.

Fair enough, I'll update that in the next version.

> > +#ifdef HAVE_GETRLIMIT
> > +/*
> > + * Increases the open file limit (RLIMIT_NOFILE) by the requested amount.
> > + * Returns true if successful, false otherwise.
> > + */
> > +static bool
> > +IncreaseOpenFileLimit(int extra_files)
> > +{
>
> Why is this a relative instead of an absolute amount?

I don't feel strongly about this, but it seemed nicer to encapsulate
that. My callers (and I expect the io_uring one too) all had the
relative amount. Seemed a bit silly to always require the caller to
add custom_max_open_files.rlim_cur to that, especially since you might
want to call IncreaseOpenFileLimit from another file for your io_uring
stuff, which would mean also making custom_max_open_files.rlim_cur
non-static. Or of course you'd have to query getrlimit again.

> > +     if (setrlimit(RLIMIT_NOFILE, &rlim) != 0)
> > +     {
> > +             ereport(WARNING, (errmsg("setrlimit failed: %m")));
> > +             return false;
> > +     }
>
> Is a WARNING really appropriate here?  I guess it's just copied from elsewhere
> in the file, but still...

I mean this shouldn't ever fail. We checked beforehand if we are
allowed to increase it this far. So if setrlimit fails then there's a
bug somewhere. That made me think it was sensible to report a warning,
so at least people can see something strange is going on.

> > +     if (setrlimit(RLIMIT_NOFILE, &original_max_open_files) != 0)
> > +     {
> > +             ereport(WARNING, (errmsg("setrlimit failed: %m")));
> > +     }
> > +#endif
> > +}
>
> Hm. Does this actually work if we currently have more than
> original_max_open_files.rlim_cur files open?

So according to at least the Linux and FreeBSD manpages, lowering
below the currently open files is not a failure mode.
- https://linux.die.net/man/2/setrlimit
- https://man.freebsd.org/cgi/man.cgi?query=setrlimit

I also tried this out manually on my Linux machine, by starting
postgres with only a soft limit of 15 (lower than that and it wouldn't
start). I was then able to change the limits without issue when
calling setrlimit or system.

... Sadly that turned out not to be true when calling popen in
OpenPipeStream (ofcourse). Because that will actually open a file in
the postgres process too, and at that point you're already over your
file limit. I can see two ways around that:
1. Writing a custom version of popen, where we lower the limit just
before we call exec in the forked process.
2. Don't care about restoring the original limit for OpenPipeStream

My suggestion would be to go for 2: It doesn't seem worth the
complexity of having a custom popen implementation, just to be able to
handle buggy programs (i.e. ones that cannot handle more than 1024
files, while still trying to open more than that amount of files)

> It'd be fine to do the system() with more files open, because it'll internally
> do exec(), but of course setrlimit() doesn't know that ...

I'm not sure I understand what you mean here.


> All these ifdefs make this function rather hard to read.  It was kinda bad
> before, but it does get even worse with this patch.  Not really sure what to
> do about that though.

Agreed, I'll see if I can figure out a way to restructure it a bit.

> We were discussing calling set_max_fds() twice to deal with io_uring FDs
> requiring a higher FD limit than before.  With the patch as-is, we'd overwrite
> our original original_max_open_files in that case...

That should be very easy to change if that's indeed how we want to
bump the rlimit for the io_uring file descriptors. But it didn't seem
like you were sure which of the two options you liked best.

> Hm. When is this second case here reachable and useful? Shouldn't we already
> have increased the limit before getting here? If so IncreaseOpenFileLimit()
> won't help, no?

While it's a very unlikely edge case, it's theoretically possible
(afaict) that the maximum number of open files is already open. So
highestfd will then be 0 (because it's the first loop), and we'll only
find out that we're at the limit due to getting ENFILE.

I'll add a code comment about this, because it's indeed not obvious.

> > @@ -1078,6 +1164,7 @@ set_max_safe_fds(void)
> >                max_safe_fds, usable_fds, already_open);
> >  }
> >
> > +
> >  /*
> >   * Open a file with BasicOpenFilePerm() and pass default file mode for the
> >   * fileMode parameter.
>
> Unrelated change.

Ugh, yeah I guess I added the new functions there originally, but
moved them around and didn't clean up the added whitespace.

> If we want to reflect the value, shouldn't it show up as PGC_S_OVERRIDE or
> such?

Makes sense
From 761c3b0a161eb9d680354c9c20a872091ce224be Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-t...@jeltef.nl>
Date: Wed, 12 Feb 2025 01:08:07 +0100
Subject: [PATCH v3 3/3] Reflect the value of max_safe_fds in
 max_files_per_process

It is currently hard to figure out if max_safe_fds is significantly
lower than max_files_per_process. This starts reflecting the value of
max_safe_fds in max_files_per_process after our limit detection. We
still want to have two separate variables because for the bootstrap or
standalone-backend cases their values differ on purpose.
---
 src/backend/storage/file/fd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 04fb93be56d..05c0792e4e1 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1149,6 +1149,9 @@ set_max_safe_fds(void)
 
 	max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process);
 
+	/* Update GUC variable to allow users to see the result */
+	max_files_per_process = max_safe_fds;
+
 	/*
 	 * Make sure we still have enough to get by.
 	 */
-- 
2.43.0

From f3ed3a3484e1d3dcfea3d7f94cd548533a84995d Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-t...@jeltef.nl>
Date: Tue, 11 Feb 2025 19:15:36 +0100
Subject: [PATCH v3 2/3] Bump postmaster soft open file limit (RLIMIT_NOFILE)
 when necessary

The default open file limit of 1024 on Linux is extremely low. The
reason that this hasn't changed change is because doing so would break
legacy programs that use the select(2) system call in hard to debug
ways. So instead programs that want to opt-in to a higher open file
limit are expected to bump their soft limit to their hard limit on
startup. Details on this are very well explained in a blogpost by the
systemd author[1]. There's also a similar change done by the Go
language[2].

This starts bumping postmaster its soft open file limit when we realize
that we'll run into the soft limit with the requested
max_files_per_process GUC. We do so by slightly changing the meaning of
the max_files_per_process GUC. The actual (not publicly exposed) limit
is max_safe_fds, previously this would be set to:
max_files_per_process - already_open_files - NUM_RESERVED_FDS
After this change we now try to set max_safe_fds to
max_files_per_process if the system allows that. This is deemed more
natural to understand for users, because now the limit of files that
they can open is actually what they configured in max_files_per_process.

Adding this infrastructure to change RLIMIT_NOFILE when needed is
especially useful for the AIO work that Andres is doing, because
io_uring consumes a lot of file descriptors. Even without looking at AIO
there is a large number of reports from people that require changing
their soft file limit before starting Postgres, sometimes falling back
to lowering max_files_per_process when they fail to do so[3-8]. It's
also not all that strange to fail at setting the soft open file limit
because there are multiple places where one can configure such limits
and usually only one of them is effective (which one depends on how
Postgres is started). In cloud environments its also often not possible
for user to change the soft limit, because they don't control the way
that Postgres is started.

One thing to note is that we temporarily restore the original soft
limit when shell-ing out to other executables. This is done as a
precaution in case those executables are using select(2).

[1]: https://0pointer.net/blog/file-descriptor-limits.html
[2]: https://github.com/golang/go/issues/46279
[3]: https://serverfault.com/questions/785330/getting-too-many-open-files-error-for-postgres
[4]: https://serverfault.com/questions/716982/how-to-raise-max-no-of-file-descriptors-for-daemons-running-on-debian-jessie
[5]: https://www.postgresql.org/message-id/flat/CAKtc8vXh7NvP_qWj8EqqorPY97bvxSaX3h5u7a9PptRFHW5x7g%40mail.gmail.com
[6]: https://www.postgresql.org/message-id/flat/113ce31b0908120955w77029099i7ececc053084095a%40mail.gmail.com
[7]: https://github.com/abiosoft/colima/discussions/836
[8]: https://www.postgresql.org/message-id/flat/29663.1007738957%40sss.pgh.pa.us#2079ec9e2d8b251593812a3711bfe9e9
---
 doc/src/sgml/config.sgml                 |  17 ++--
 src/backend/access/transam/xlogarchive.c |   4 +
 src/backend/archive/shell_archive.c      |   2 +
 src/backend/storage/file/fd.c            | 111 ++++++++++++++++++++---
 src/include/storage/fd.h                 |   2 +
 5 files changed, 116 insertions(+), 20 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 336630ce417..6b4bf78a9fc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2352,15 +2352,14 @@ include_dir 'conf.d'
       </term>
       <listitem>
        <para>
-        Sets the maximum number of simultaneously open files allowed to each
-        server subprocess. The default is one thousand files. If the kernel is enforcing
-        a safe per-process limit, you don't need to worry about this setting.
-        But on some platforms (notably, most BSD systems), the kernel will
-        allow individual processes to open many more files than the system
-        can actually support if many processes all try to open
-        that many files. If you find yourself seeing <quote>Too many open
-        files</quote> failures, try reducing this setting.
-        This parameter can only be set at server start.
+        Sets the maximum number of files to each server subprocess is allowed
+        to open. The default is one thousand files. If the kernel is enforcing
+        a lower soft limit Postgres will automatically increase it if the hard
+        limit allows that. Setting this value too high can cause resource
+        exhaustion problems: On some platforms (notably, most BSD systems), the
+        kernel will allow individual processes to open many more files than the
+        system can actually support if many processes all try to open that many
+        files. This parameter can only be set at server start.
        </para>
       </listitem>
      </varlistentry>
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 1ef1713c91a..bf17426039e 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -158,6 +158,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 			(errmsg_internal("executing restore command \"%s\"",
 							 xlogRestoreCmd)));
 
+	RestoreOriginalOpenFileLimit();
 	fflush(NULL);
 	pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND);
 
@@ -180,6 +181,7 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 
 	pgstat_report_wait_end();
 	pfree(xlogRestoreCmd);
+	RestoreCustomOpenFileLimit();
 
 	if (rc == 0)
 	{
@@ -325,10 +327,12 @@ ExecuteRecoveryCommand(const char *command, const char *commandName,
 	/*
 	 * execute the constructed command
 	 */
+	RestoreOriginalOpenFileLimit();
 	fflush(NULL);
 	pgstat_report_wait_start(wait_event_info);
 	rc = system(xlogRecoveryCmd);
 	pgstat_report_wait_end();
+	RestoreCustomOpenFileLimit();
 
 	pfree(xlogRecoveryCmd);
 
diff --git a/src/backend/archive/shell_archive.c b/src/backend/archive/shell_archive.c
index 828723afe47..c16f354a118 100644
--- a/src/backend/archive/shell_archive.c
+++ b/src/backend/archive/shell_archive.c
@@ -75,10 +75,12 @@ shell_archive_file(ArchiveModuleState *state, const char *file,
 			(errmsg_internal("executing archive command \"%s\"",
 							 xlogarchcmd)));
 
+	RestoreOriginalOpenFileLimit();
 	fflush(NULL);
 	pgstat_report_wait_start(WAIT_EVENT_ARCHIVE_COMMAND);
 	rc = system(xlogarchcmd);
 	pgstat_report_wait_end();
+	RestoreCustomOpenFileLimit();
 
 	if (rc != 0)
 	{
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index e454db4c020..04fb93be56d 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -157,6 +157,12 @@ int			max_files_per_process = 1000;
  */
 int			max_safe_fds = FD_MINFREE;	/* default if not changed */
 
+#ifdef HAVE_GETRLIMIT
+static struct rlimit original_max_open_files;
+static struct rlimit custom_max_open_files;
+#endif
+
+
 /* Whether it is safe to continue running after fsync() fails. */
 bool		data_sync_retry = false;
 
@@ -945,6 +951,82 @@ InitTemporaryFileAccess(void)
 #endif
 }
 
+#ifdef HAVE_GETRLIMIT
+/*
+ * Increases the open file limit (RLIMIT_NOFILE) by the requested amount.
+ * Returns true if successful, false otherwise.
+ */
+static bool
+IncreaseOpenFileLimit(int extra_files)
+{
+	struct rlimit rlim = custom_max_open_files;
+
+	/* If we're already at the max we reached our limit */
+	if (rlim.rlim_cur == original_max_open_files.rlim_max)
+		return false;
+
+	/* Otherwise try to increase the soft limit to what we need */
+	rlim.rlim_cur = Min(rlim.rlim_cur + extra_files, rlim.rlim_max);
+
+	if (setrlimit(RLIMIT_NOFILE, &rlim) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+		return false;
+	}
+
+	custom_max_open_files = rlim;
+
+	elog(LOG, "increased open file limit to %ld", (long) rlim.rlim_cur);
+
+	return true;
+}
+#endif
+
+/*
+ * RestoreOriginalOpenFileLimit --- Restore the original open file limit that
+ * 		was present at postmaster start.
+ *
+ * This should be called before spawning subprocesses that might use select(2)
+ * which can only handle file descriptors up to 1024.
+ */
+void
+RestoreOriginalOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &original_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+
+/*
+ * RestoreCustomOpenFileLimit --- Restores our custom open file limit after a
+ * 		previous call to RestoreOriginalOpenFileLimit.
+ */
+void
+RestoreCustomOpenFileLimit(void)
+{
+#ifdef HAVE_GETRLIMIT
+	if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur)
+	{
+		/* Not changed, so no need to call setrlimit at all */
+		return;
+	}
+
+	if (setrlimit(RLIMIT_NOFILE, &custom_max_open_files) != 0)
+	{
+		ereport(WARNING, (errmsg("setrlimit failed: %m")));
+	}
+#endif
+}
+
 /*
  * count_usable_fds --- count how many FDs the system will let us open,
  *		and estimate how many are already open.
@@ -969,7 +1051,6 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
 	int			j;
 
 #ifdef HAVE_GETRLIMIT
-	struct rlimit rlim;
 	int			getrlimit_status;
 #endif
 
@@ -977,9 +1058,11 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
 	fd = (int *) palloc(size * sizeof(int));
 
 #ifdef HAVE_GETRLIMIT
-	getrlimit_status = getrlimit(RLIMIT_NOFILE, &rlim);
+	getrlimit_status = getrlimit(RLIMIT_NOFILE, &original_max_open_files);
 	if (getrlimit_status != 0)
 		ereport(WARNING, (errmsg("getrlimit failed: %m")));
+	else
+		custom_max_open_files = original_max_open_files;
 #endif							/* HAVE_GETRLIMIT */
 
 	/* dup until failure or probe limit reached */
@@ -993,13 +1076,21 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open)
 		 * don't go beyond RLIMIT_NOFILE; causes irritating kernel logs on
 		 * some platforms
 		 */
-		if (getrlimit_status == 0 && highestfd >= rlim.rlim_cur - 1)
-			break;
+		if (getrlimit_status == 0 && highestfd >= custom_max_open_files.rlim_cur - 1)
+		{
+			if (!IncreaseOpenFileLimit(max_to_probe - used))
+				break;
+		}
 #endif
 
 		thisfd = dup(2);
 		if (thisfd < 0)
 		{
+#ifdef HAVE_GETRLIMIT
+			if (errno == ENFILE && IncreaseOpenFileLimit(max_to_probe - used))
+				continue;
+#endif
+
 			/* Expect EMFILE or ENFILE, else it's fishy */
 			if (errno != EMFILE && errno != ENFILE)
 				elog(WARNING, "duplicating stderr file descriptor failed after %d successes: %m", used);
@@ -1053,15 +1144,10 @@ set_max_safe_fds(void)
 	 * or the experimentally-determined EMFILE limit.
 	 *----------
 	 */
-	count_usable_fds(max_files_per_process,
+	count_usable_fds(max_files_per_process + NUM_RESERVED_FDS,
 					 &usable_fds, &already_open);
 
-	max_safe_fds = Min(usable_fds, max_files_per_process - already_open);
-
-	/*
-	 * Take off the FDs reserved for system() etc.
-	 */
-	max_safe_fds -= NUM_RESERVED_FDS;
+	max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process);
 
 	/*
 	 * Make sure we still have enough to get by.
@@ -1078,6 +1164,7 @@ set_max_safe_fds(void)
 		 max_safe_fds, usable_fds, already_open);
 }
 
+
 /*
  * Open a file with BasicOpenFilePerm() and pass default file mode for the
  * fileMode parameter.
@@ -2724,6 +2811,7 @@ OpenPipeStream(const char *command, const char *mode)
 	ReleaseLruFiles();
 
 TryAgain:
+	RestoreOriginalOpenFileLimit();
 	fflush(NULL);
 	pqsignal(SIGPIPE, SIG_DFL);
 	errno = 0;
@@ -2731,6 +2819,7 @@ TryAgain:
 	save_errno = errno;
 	pqsignal(SIGPIPE, SIG_IGN);
 	errno = save_errno;
+	RestoreCustomOpenFileLimit();
 	if (file != NULL)
 	{
 		AllocateDesc *desc = &allocatedDescs[numAllocatedDescs];
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e3067ab6597..d24e7f1c8df 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -177,6 +177,8 @@ extern void RemovePgTempFiles(void);
 extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok,
 								   bool unlink_all);
 extern bool looks_like_temp_rel_name(const char *name);
+extern void RestoreOriginalOpenFileLimit(void);
+extern void RestoreCustomOpenFileLimit(void);
 
 extern int	pg_fsync(int fd);
 extern int	pg_fsync_no_writethrough(int fd);
-- 
2.43.0

From fe11e75d913ac320f11ac2250c94fa58b118b2e1 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <github-t...@jeltef.nl>
Date: Tue, 11 Feb 2025 22:04:44 +0100
Subject: [PATCH v3 1/3] Bump pgbench soft open file limit (RLIMIT_NOFILE) to
 hard limit

The default open file limit of 1024 is extremely low, given modern
resources and kernel architectures. The reason that this hasn't changed
is because doing so would break legacy programs that use the select(2)
system call in hard to debug ways. So instead programs that want to
opt-in to a higher open file limit are expected to bump their soft limit
to their hard limit on startup. Details on this are very well explained
in a blogpost by the systemd author[1].

This starts bumping pgbench its soft open file limit to the hard open
file limit. This makes sure users are not told to change their ulimit,
and then retry. Instead we now do that for them automatically.

[1]: https://0pointer.net/blog/file-descriptor-limits.html
---
 src/bin/pgbench/pgbench.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 5e1fcf59c61..8f53bf50ab1 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -6815,13 +6815,26 @@ main(int argc, char **argv)
 #ifdef HAVE_GETRLIMIT
 				if (getrlimit(RLIMIT_NOFILE, &rlim) == -1)
 					pg_fatal("getrlimit failed: %m");
-				if (rlim.rlim_cur < nclients + 3)
+
+				if (rlim.rlim_max < nclients + 3)
 				{
 					pg_log_error("need at least %d open files, but system limit is %ld",
-								 nclients + 3, (long) rlim.rlim_cur);
+								 nclients + 3, (long) rlim.rlim_max);
 					pg_log_error_hint("Reduce number of clients, or use limit/ulimit to increase the system limit.");
 					exit(1);
 				}
+
+				if (rlim.rlim_cur < nclients + 3)
+				{
+					rlim.rlim_cur = nclients + 3;
+					if (setrlimit(RLIMIT_NOFILE, &rlim) == -1)
+					{
+						pg_log_error("need at least %d open files, but couldn't raise the limit",
+									 nclients + 3);
+						pg_log_error_hint("Reduce number of clients, or use limit/ulimit to increase the system limit.");
+						exit(1);
+					}
+				}
 #endif							/* HAVE_GETRLIMIT */
 				break;
 			case 'C':

base-commit: c407d5426b877b41be87f9e679e321bb2c42e47d
-- 
2.43.0

Reply via email to