[slurm-users] Munge log-file fills up the file system to 100%

Ole Holm Nielsen via slurm-users Mon, 15 Apr 2024 05:14:49 -0700

We have some new AMD EPYC compute nodes with 96 cores/node runningRockyLinux 8.9. We've had a number of incidents where the Munge log-file/var/log/munge/munged.log suddenly fills up the root file system, after awhile to 100% (tens of GBs), and the node eventually comes to a grindinghalt! Wiping munged.log and restarting the node works around the issue.


I've tried to track down the symptoms and this is what I found:


1. In munged.log there are infinitely many lines filling up the disk:

2024-04-11 09:59:29 +0200 Info: Suspended new connections whileprocessing backlog


2. The slurmd is not getting any responses from munged, even though we run
   "munged --num-threads 10".  The slurmd.log displays errors like:

[2024-04-12T02:05:45.001] error: If munged is up, restart with--num-threads=10[2024-04-12T02:05:45.001] error: Munge encode failed: Failed toconnect to "/var/run/munge/munge.socket.2": Resource temporarily unavailable[2024-04-12T02:05:45.001] error: slurm_buffers_pack_msg:auth_g_create: RESPONSE_ACCT_GATHER_UPDATE has authentication error


3. The /var/log/messages displays the errors from slurmd as well as
   NetworkManager saying "Too many open files in system".
   The telltale syslog entry seems to be:

   Apr 12 02:05:48 e009 kernel: VFS: file-max limit 65536 reached

   where the limit is confirmed in /proc/sys/fs/file-max.

We have never before seen any such errors from Munge. The error mayperhaps be triggered by certain user codes (possibly star-ccm+) that mightbe opening a lot more files on the 96-core nodes than on nodes with alower core count.


My workaround has been to edit the line in /etc/sysctl.conf:

fs.file-max = 131072

and update settings by "sysctl -p". We haven't seen any of the Mungeerrors since!

The version of Munge in RockyLinux 8.9 is 0.5.13, but there is a newerversion in https://github.com/dun/munge/releases/tag/munge-0.5.16

I can't figure out if 0.5.16 has a fix for the issue seen here?

Questions: Have other sites seen the present Munge issue as well? Arethere any good recommendations for setting the fs.file-max parameter onSlurm compute nodes?


Thanks for sharing your insights,
Ole

--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Munge log-file fills up the file system to 100%

Reply via email to