We have some new AMD EPYC compute nodes with 96 cores/node running
RockyLinux 8.9. We've had a number of incidents where the Munge log-file
/var/log/munge/munged.log suddenly fills up the root file system, after a
while to 100% (tens of GBs), and the node eventually comes to a grinding
halt! Wiping munged.log and restarting the node works around the issue.
I've tried to track down the symptoms and this is what I found:
1. In munged.log there are infinitely many lines filling up the disk:
2024-04-11 09:59:29 +0200 Info: Suspended new connections while
processing backlog
2. The slurmd is not getting any responses from munged, even though we run
"munged --num-threads 10". The slurmd.log displays errors like:
[2024-04-12T02:05:45.001] error: If munged is up, restart with
--num-threads=10
[2024-04-12T02:05:45.001] error: Munge encode failed: Failed to
connect to "/var/run/munge/munge.socket.2": Resource temporarily unavailable
[2024-04-12T02:05:45.001] error: slurm_buffers_pack_msg:
auth_g_create: RESPONSE_ACCT_GATHER_UPDATE has authentication error
3. The /var/log/messages displays the errors from slurmd as well as
NetworkManager saying "Too many open files in system".
The telltale syslog entry seems to be:
Apr 12 02:05:48 e009 kernel: VFS: file-max limit 65536 reached
where the limit is confirmed in /proc/sys/fs/file-max.
We have never before seen any such errors from Munge. The error may
perhaps be triggered by certain user codes (possibly star-ccm+) that might
be opening a lot more files on the 96-core nodes than on nodes with a
lower core count.
My workaround has been to edit the line in /etc/sysctl.conf:
fs.file-max = 131072
and update settings by "sysctl -p". We haven't seen any of the Munge
errors since!
The version of Munge in RockyLinux 8.9 is 0.5.13, but there is a newer
version in https://github.com/dun/munge/releases/tag/munge-0.5.16
I can't figure out if 0.5.16 has a fix for the issue seen here?
Questions: Have other sites seen the present Munge issue as well? Are
there any good recommendations for setting the fs.file-max parameter on
Slurm compute nodes?
Thanks for sharing your insights,
Ole
--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com