On Tue, 11 Feb 2025 at 22:18, Tom Lane <t...@sss.pgh.pa.us> wrote: > > Andres Freund <and...@anarazel.de> writes: > > My suggestion would be to redefine max_files_per_process as the number of > > files we try to be able to open in backends. I.e. set_max_safe_fds() would > > first count the number of already open fds (since those will largely be > > inherited by child processes) and then check if we can open up to > > max_files_per_process files in addition. Adjusting the RLIMIT_NOFILE if > > necessary. > > Seems plausible. IIRC we also want 10 or so FDs available as "slop" > for code that doesn't go through fd.c.
Attached is a patchset that does this. I split off the pgbench change, which is in the first patch. The change to postmaster is in the second. And the 3rd patch is a small follow up to make it easier to notice that max_safe_fds is lower than intended.
From 8e964db585989734a5f6c1449ffb4c62e1190a6a Mon Sep 17 00:00:00 2001 From: Jelte Fennema-Nio <github-t...@jeltef.nl> Date: Tue, 11 Feb 2025 22:04:44 +0100 Subject: [PATCH v2 1/3] Bump pgbench soft open file limit (RLIMIT_NOFILE) to hard limit The default open file limit of 1024 is extremely low, given modern resources and kernel architectures. The reason that this hasn't changed is because doing so would break legacy programs that use the select(2) system call in hard to debug ways. So instead programs that want to opt-in to a higher open file limit are expected to bump their soft limit to their hard limit on startup. Details on this are very well explained in a blogpost by the systemd author[1]. This starts bumping pgbench its soft open file limit to the hard open file limit. This makes sure users are not told to change their ulimit, and then retry. Instead we now do that for them automatically. [1]: https://0pointer.net/blog/file-descriptor-limits.html --- src/bin/pgbench/pgbench.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 5e1fcf59c61..ea7bf980bea 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -6815,6 +6815,15 @@ main(int argc, char **argv) #ifdef HAVE_GETRLIMIT if (getrlimit(RLIMIT_NOFILE, &rlim) == -1) pg_fatal("getrlimit failed: %m"); + + /* + * Bump the soft limit to the hard limit to not run into low + * file limits. + */ + rlim.rlim_cur = rlim.rlim_max; + if (setrlimit(RLIMIT_NOFILE, &rlim) == -1) + pg_fatal("setrlimit failed: %m"); + if (rlim.rlim_cur < nclients + 3) { pg_log_error("need at least %d open files, but system limit is %ld", base-commit: fd602f29c19d4f483f54d93abe240c12219d9f51 -- 2.43.0
From e84092e3db7712db08dfc5f8824f692a53a1aa9e Mon Sep 17 00:00:00 2001 From: Jelte Fennema-Nio <github-t...@jeltef.nl> Date: Tue, 11 Feb 2025 19:15:36 +0100 Subject: [PATCH v2 2/3] Bump postmaster soft open file limit (RLIMIT_NOFILE) when necessary The default open file limit of 1024 on Linux is extremely low. The reason that this hasn't changed change is because doing so would break legacy programs that use the select(2) system call in hard to debug ways. So instead programs that want to opt-in to a higher open file limit are expected to bump their soft limit to their hard limit on startup. Details on this are very well explained in a blogpost by the systemd author[1]. There's also a similar change done by the Go language[2]. This starts bumping postmaster its soft open file limit when we realize that we'll run into the soft limit with the requested max_files_per_process GUC. We do so by slightly changing the meaning of the max_files_per_process GUC. The actual (not publicly exposed) limit is max_safe_fds, previously this would be set to: max_files_per_process - already_open_files - NUM_RESERVED_FDS After this change we now try to set max_safe_fds to max_files_per_process if the system allows that. This is deemed more natural to understand for users, because now the limit of files that they can open is actually what they configured in max_files_per_process. Adding this infrastructure to change RLIMIT_NOFILE when needed is especially useful for the AIO work that Andres is doing, because io_uring consumes a lot of file descriptors. Even without looking at AIO there is a large number of reports from people that require changing their soft file limit before starting Postgres, sometimes falling back to lowering max_files_per_process when they fail to do so[3-8]. It's also not all that strange to fail at setting the soft open file limit because there are multiple places where one can configure such limits and usually only one of them is effective (which one depends on how Postgres is started). In cloud environments its also often not possible for user to change the soft limit, because they don't control the way that Postgres is started. One thing to note is that we temporarily restore the original soft limit when shell-ing out to other executables. This is done as a precaution in case those executables are using select(2). [1]: https://0pointer.net/blog/file-descriptor-limits.html [2]: https://github.com/golang/go/issues/46279 [3]: https://serverfault.com/questions/785330/getting-too-many-open-files-error-for-postgres [4]: https://serverfault.com/questions/716982/how-to-raise-max-no-of-file-descriptors-for-daemons-running-on-debian-jessie [5]: https://www.postgresql.org/message-id/flat/CAKtc8vXh7NvP_qWj8EqqorPY97bvxSaX3h5u7a9PptRFHW5x7g%40mail.gmail.com [6]: https://www.postgresql.org/message-id/flat/113ce31b0908120955w77029099i7ececc053084095a%40mail.gmail.com [7]: https://github.com/abiosoft/colima/discussions/836 [8]: https://www.postgresql.org/message-id/flat/29663.1007738957%40sss.pgh.pa.us#2079ec9e2d8b251593812a3711bfe9e9 --- doc/src/sgml/config.sgml | 17 ++-- src/backend/access/transam/xlogarchive.c | 4 + src/backend/archive/shell_archive.c | 2 + src/backend/storage/file/fd.c | 111 ++++++++++++++++++++--- src/include/storage/fd.h | 2 + 5 files changed, 116 insertions(+), 20 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 5e4f201e099..4886a02d47e 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2352,15 +2352,14 @@ include_dir 'conf.d' </term> <listitem> <para> - Sets the maximum number of simultaneously open files allowed to each - server subprocess. The default is one thousand files. If the kernel is enforcing - a safe per-process limit, you don't need to worry about this setting. - But on some platforms (notably, most BSD systems), the kernel will - allow individual processes to open many more files than the system - can actually support if many processes all try to open - that many files. If you find yourself seeing <quote>Too many open - files</quote> failures, try reducing this setting. - This parameter can only be set at server start. + Sets the maximum number of files to each server subprocess is allowed + to open. The default is one thousand files. If the kernel is enforcing + a lower soft limit Postgres will automatically increase it if the hard + limit allows that. Setting this value too high can cause resource + exhaustion problems: On some platforms (notably, most BSD systems), the + kernel will allow individual processes to open many more files than the + system can actually support if many processes all try to open that many + files. This parameter can only be set at server start. </para> </listitem> </varlistentry> diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 1ef1713c91a..bf17426039e 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -158,6 +158,7 @@ RestoreArchivedFile(char *path, const char *xlogfname, (errmsg_internal("executing restore command \"%s\"", xlogRestoreCmd))); + RestoreOriginalOpenFileLimit(); fflush(NULL); pgstat_report_wait_start(WAIT_EVENT_RESTORE_COMMAND); @@ -180,6 +181,7 @@ RestoreArchivedFile(char *path, const char *xlogfname, pgstat_report_wait_end(); pfree(xlogRestoreCmd); + RestoreCustomOpenFileLimit(); if (rc == 0) { @@ -325,10 +327,12 @@ ExecuteRecoveryCommand(const char *command, const char *commandName, /* * execute the constructed command */ + RestoreOriginalOpenFileLimit(); fflush(NULL); pgstat_report_wait_start(wait_event_info); rc = system(xlogRecoveryCmd); pgstat_report_wait_end(); + RestoreCustomOpenFileLimit(); pfree(xlogRecoveryCmd); diff --git a/src/backend/archive/shell_archive.c b/src/backend/archive/shell_archive.c index 828723afe47..c16f354a118 100644 --- a/src/backend/archive/shell_archive.c +++ b/src/backend/archive/shell_archive.c @@ -75,10 +75,12 @@ shell_archive_file(ArchiveModuleState *state, const char *file, (errmsg_internal("executing archive command \"%s\"", xlogarchcmd))); + RestoreOriginalOpenFileLimit(); fflush(NULL); pgstat_report_wait_start(WAIT_EVENT_ARCHIVE_COMMAND); rc = system(xlogarchcmd); pgstat_report_wait_end(); + RestoreCustomOpenFileLimit(); if (rc != 0) { diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index e454db4c020..04fb93be56d 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -157,6 +157,12 @@ int max_files_per_process = 1000; */ int max_safe_fds = FD_MINFREE; /* default if not changed */ +#ifdef HAVE_GETRLIMIT +static struct rlimit original_max_open_files; +static struct rlimit custom_max_open_files; +#endif + + /* Whether it is safe to continue running after fsync() fails. */ bool data_sync_retry = false; @@ -945,6 +951,82 @@ InitTemporaryFileAccess(void) #endif } +#ifdef HAVE_GETRLIMIT +/* + * Increases the open file limit (RLIMIT_NOFILE) by the requested amount. + * Returns true if successful, false otherwise. + */ +static bool +IncreaseOpenFileLimit(int extra_files) +{ + struct rlimit rlim = custom_max_open_files; + + /* If we're already at the max we reached our limit */ + if (rlim.rlim_cur == original_max_open_files.rlim_max) + return false; + + /* Otherwise try to increase the soft limit to what we need */ + rlim.rlim_cur = Min(rlim.rlim_cur + extra_files, rlim.rlim_max); + + if (setrlimit(RLIMIT_NOFILE, &rlim) != 0) + { + ereport(WARNING, (errmsg("setrlimit failed: %m"))); + return false; + } + + custom_max_open_files = rlim; + + elog(LOG, "increased open file limit to %ld", (long) rlim.rlim_cur); + + return true; +} +#endif + +/* + * RestoreOriginalOpenFileLimit --- Restore the original open file limit that + * was present at postmaster start. + * + * This should be called before spawning subprocesses that might use select(2) + * which can only handle file descriptors up to 1024. + */ +void +RestoreOriginalOpenFileLimit(void) +{ +#ifdef HAVE_GETRLIMIT + if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur) + { + /* Not changed, so no need to call setrlimit at all */ + return; + } + + if (setrlimit(RLIMIT_NOFILE, &original_max_open_files) != 0) + { + ereport(WARNING, (errmsg("setrlimit failed: %m"))); + } +#endif +} + +/* + * RestoreCustomOpenFileLimit --- Restores our custom open file limit after a + * previous call to RestoreOriginalOpenFileLimit. + */ +void +RestoreCustomOpenFileLimit(void) +{ +#ifdef HAVE_GETRLIMIT + if (custom_max_open_files.rlim_cur == original_max_open_files.rlim_cur) + { + /* Not changed, so no need to call setrlimit at all */ + return; + } + + if (setrlimit(RLIMIT_NOFILE, &custom_max_open_files) != 0) + { + ereport(WARNING, (errmsg("setrlimit failed: %m"))); + } +#endif +} + /* * count_usable_fds --- count how many FDs the system will let us open, * and estimate how many are already open. @@ -969,7 +1051,6 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open) int j; #ifdef HAVE_GETRLIMIT - struct rlimit rlim; int getrlimit_status; #endif @@ -977,9 +1058,11 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open) fd = (int *) palloc(size * sizeof(int)); #ifdef HAVE_GETRLIMIT - getrlimit_status = getrlimit(RLIMIT_NOFILE, &rlim); + getrlimit_status = getrlimit(RLIMIT_NOFILE, &original_max_open_files); if (getrlimit_status != 0) ereport(WARNING, (errmsg("getrlimit failed: %m"))); + else + custom_max_open_files = original_max_open_files; #endif /* HAVE_GETRLIMIT */ /* dup until failure or probe limit reached */ @@ -993,13 +1076,21 @@ count_usable_fds(int max_to_probe, int *usable_fds, int *already_open) * don't go beyond RLIMIT_NOFILE; causes irritating kernel logs on * some platforms */ - if (getrlimit_status == 0 && highestfd >= rlim.rlim_cur - 1) - break; + if (getrlimit_status == 0 && highestfd >= custom_max_open_files.rlim_cur - 1) + { + if (!IncreaseOpenFileLimit(max_to_probe - used)) + break; + } #endif thisfd = dup(2); if (thisfd < 0) { +#ifdef HAVE_GETRLIMIT + if (errno == ENFILE && IncreaseOpenFileLimit(max_to_probe - used)) + continue; +#endif + /* Expect EMFILE or ENFILE, else it's fishy */ if (errno != EMFILE && errno != ENFILE) elog(WARNING, "duplicating stderr file descriptor failed after %d successes: %m", used); @@ -1053,15 +1144,10 @@ set_max_safe_fds(void) * or the experimentally-determined EMFILE limit. *---------- */ - count_usable_fds(max_files_per_process, + count_usable_fds(max_files_per_process + NUM_RESERVED_FDS, &usable_fds, &already_open); - max_safe_fds = Min(usable_fds, max_files_per_process - already_open); - - /* - * Take off the FDs reserved for system() etc. - */ - max_safe_fds -= NUM_RESERVED_FDS; + max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process); /* * Make sure we still have enough to get by. @@ -1078,6 +1164,7 @@ set_max_safe_fds(void) max_safe_fds, usable_fds, already_open); } + /* * Open a file with BasicOpenFilePerm() and pass default file mode for the * fileMode parameter. @@ -2724,6 +2811,7 @@ OpenPipeStream(const char *command, const char *mode) ReleaseLruFiles(); TryAgain: + RestoreOriginalOpenFileLimit(); fflush(NULL); pqsignal(SIGPIPE, SIG_DFL); errno = 0; @@ -2731,6 +2819,7 @@ TryAgain: save_errno = errno; pqsignal(SIGPIPE, SIG_IGN); errno = save_errno; + RestoreCustomOpenFileLimit(); if (file != NULL) { AllocateDesc *desc = &allocatedDescs[numAllocatedDescs]; diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h index e3067ab6597..d24e7f1c8df 100644 --- a/src/include/storage/fd.h +++ b/src/include/storage/fd.h @@ -177,6 +177,8 @@ extern void RemovePgTempFiles(void); extern void RemovePgTempFilesInDir(const char *tmpdirname, bool missing_ok, bool unlink_all); extern bool looks_like_temp_rel_name(const char *name); +extern void RestoreOriginalOpenFileLimit(void); +extern void RestoreCustomOpenFileLimit(void); extern int pg_fsync(int fd); extern int pg_fsync_no_writethrough(int fd); -- 2.43.0
From 44c196025cbf57a09f6aa1bc70646ed2537d9654 Mon Sep 17 00:00:00 2001 From: Jelte Fennema-Nio <github-t...@jeltef.nl> Date: Wed, 12 Feb 2025 01:08:07 +0100 Subject: [PATCH v2 3/3] Reflect the value of max_safe_fds in max_files_per_process It is currently hard to figure out if max_safe_fds is significantly lower than max_files_per_process. This starts reflecting the value of max_safe_fds in max_files_per_process after our limit detection. We still want to have two separate variables because for the bootstrap or standalone-backend cases their values differ on purpose. --- src/backend/storage/file/fd.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index 04fb93be56d..05c0792e4e1 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -1149,6 +1149,9 @@ set_max_safe_fds(void) max_safe_fds = Min(usable_fds - NUM_RESERVED_FDS, max_files_per_process); + /* Update GUC variable to allow users to see the result */ + max_files_per_process = max_safe_fds; + /* * Make sure we still have enough to get by. */ -- 2.43.0