[slurm-users] Re: Munge log-file fills up the file system to 100%

Ole Holm Nielsen via slurm-users Fri, 19 Apr 2024 01:44:51 -0700

It turns out that the Slurm job limits are *not* controlled by the normal/etc/security/limits.conf configuration. Any service running underSystemd (such as slurmd) has limits defined by Systemd, see [1] and [2].

The limits of processes started by slurmd are defined by LimitXXX in/usr/lib/systemd/system/slurmd.service, and current Slurm versions haveLimitNOFILE=131072.

I guess that LimitNOFILE is the limit applied to every Slurm job, and thatjobs presumably ought to crash if opening more than LimitNOFILE files?

If this is correct, I think the kernel's fs.file-max ought to be set to131072 times the maximum possible number of Slurm jobs per node, plus asafety margin for the OS. Depending on Slurm configuration, fs.file-maxshould be set to 131072 times number of CPUs plus some extra margin. Forexample, a 96-core node might have fs.file-max set to 100*131072 = 13107200.


Does this make sense?

Best regards,
Ole

[1] "How to set limits for services in RHEL and systemd"https://access.redhat.com/solutions/1257953[2]https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#slurmd-systemd-limits


On 4/18/24 11:23, Ole Holm Nielsen wrote:

I looked at some of our busy 96-core nodes where users are currentlyrunning the STAR-CCM+ CFD software.
One job runs on 4 96-core nodes. I'm amazed that each STAR-CCM+ processhas opened almost 1000 open files, for example:
$ lsof -p 440938 | wc -l
950

and that on this node the user has almost 95000 open files:

$ lsof -u <username> | wc -l
94606
So it's no wonder that 65536 open files would have been exhausted, andthat my current limit is just barely sufficient:
$ sysctl fs.file-max
fs.file-max = 131072

As an experiment I lowered the max number of files on a node:

$ sysctl fs.file-max=32768

and immediately the syslog display error messages:

Apr 18 10:54:11 e033 kernel: VFS: file-max limit 32768 reached

Munged (version 0.5.16) logged a lot of errors:
2024-04-18 10:54:33 +0200 Info: Failed to accept connection: Too manyopen files in system2024-04-18 10:55:34 +0200 Info: Failed to accept connection: Too manyopen files in system2024-04-18 10:56:35 +0200 Info: Failed to accept connection: Too manyopen files in system
2024-04-18 10:57:22 +0200 Info:      Encode retry #1 for client UID=0 GID=0
2024-04-18 10:57:22 +0200 Info:      Failed to send message: Broken pipe
(many lines deleted)

Slurmd also logged some errors:
[2024-04-18T10:57:22.070] error: slurm_send_node_msg: [(null)]slurm_bufs_sendto(msg_type=RESPONSE_ACCT_GATHER_UPDATE) failed: Unexpectedmissing socket error[2024-04-18T10:57:22.080] error: slurm_send_node_msg: [(null)]slurm_bufs_sendto(msg_type=RESPONSE_PING_SLURMD) failed: Unexpectedmissing socket error[2024-04-18T10:57:22.080] error: slurm_send_node_msg: [(null)]slurm_bufs_sendto(msg_type=RESPONSE_PING_SLURMD) failed: Unexpectedmissing socket error
The node became completely non-responsive until I restoredfs.file-max=131072.
Conclusions:
1. Munge should be upgraded to 0.5.15 or later to avoid the munged.logfilling up the disk. I summarize this in the Wiki pagehttps://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#munge-authentication-service
2. We still need some heuristics for determining sufficient values for thekernel's fs.file-max limit. I don't understand whether the kernel itselfmight set good default values, which we have noticed on some servers andlogin nodes.
As Jeffrey points out, there are both soft and hard user limits on thenumber of files, and this is what I see for a normal user:
$ ulimit -Sn   # Soft limit
1024
$ ulimit -Hn   # Hard limit
262144
Maybe the heuristics could be to multiply "ulimit -Hn" by the CPU corecount (if we believe that users will only run 1 process per core). Anextra safety margin would need to be added on top. Or maybe we needsomething a lot higher?
Question: Would there be any negative side effect of setting fs.file-maxto a very large number (10s of millions)?
Interestingly, the (possibly outdated) Large Cluster Administration Guideat https://slurm.schedmd.com/big_sys.html recommends a ridiculously lownumber:
/proc/sys/fs/file-max: The maximum number of concurrently open files. Werecommend a limit of at least 32,832.
Thanks for sharing your insights,
Ole


On 4/16/24 14:40, Jeffrey T Frey via slurm-users wrote:
AFAIK, the fs.file-max limit is a node-wide limit, whereas "ulimit -n"
is per user.
The ulimit is a frontend to rusage limits, which are per-processrestrictions (not per-user).
The fs.file-max is the kernel's limit on how many file descriptors canbe open in aggregate. You'd have to edit that with sysctl:
    *$ sysctl fs.file-max*
    fs.file-max = 26161449
Check in e.g. /etc/sysctl.conf or /etc/sysctl.d if you have analternative limit versus the default.
But if you have ulimit -n == 1024, then no user should be able to hit
the fs.file-max limit, even if it is 65536.  (Technically, 96 jobs from
96 users each trying to open 1024 files would do it, though.)
Naturally, since the ulimit is per-process the equating of core countwith the multiplier isn't valid. It also assumes Slurm isn't setup tooversubscribe CPU resources :-)
I'm not sure how the number 3092846 got set, since it's not defined in
/etc/security/limits.conf.  The "ulimit -u" varies quite a bit among
our compute nodes, so which dynamic service might affect the limits?
If the 1024 is a soft limit, you may have users who are raising it toarbitrary values themselves, for example. Especially as 1024 issomewhat low for the more naively-written data science Python code I seeon our systems. If Slurm is configured to propagate submission shellulimits to the runtime environment and you allow submission from avariety of nodes/systems you could be seeing myriad limits reconstitutedon the compute node despite the /etc/security/limits.conf settings.
The main question needing an answer is _what_ process(es) are openingall the files on your systems that are faltering. It's very likely tobe user jobs' opening all of them, I was just hoping to also rule outany bug in munged. Since you're upgrading munged, you'll now get theerrno associated with the backlog and can confirm EMFILE vs. ENFILEvs. ENOMEM.


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Munge log-file fills up the file system to 100%

Reply via email to