Do you have a firewall between the slurmd and the slurmctld daemons?  If yes, 
do you know what kind of idle timeout that firewall has for expiring idle 
sessions?  I ran into something somewhat similar but for me it was between the 
slurmctld and slurmdbd where a recent change they made had one direction 
between those two daemons left idle unless certain operations occurred and we 
did have a firewall device between them that was expiring sessions.  In our 
case 23.11.1 brought a fix for that specific issue for us.  I never had issues 
between slurmctld and slurmd (though the firewall is not between those two 
layers).

--
Brian D. Haymore
University of Utah
Center for High Performance Computing
155 South 1452 East RM 405
Salt Lake City, Ut 84112
Phone: 801-558-1150
http://bit.ly/1HO1N2C
________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Fokke 
Dijkstra <f.dijks...@rug.nl>
Sent: Tuesday, January 23, 2024 4:00 AM
To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: [slurm-users] Issues with Slurm 23.11.1

Dear all,

Since the upgrade from Slurm 22.05 to 23.11.1 we are having problems with the 
communication between the slurmctld and slurmd processes.
We are running a cluster with 183 nodes and almost 19000 cores. Unfortunately 
some nodes are in a different network preventing full internode communication. 
A network topology and setting TopologyParam RouteTree have been used to make 
sure no slurmd communication happens between nodes on different networks.

In the new Slurm version we see the following issues, which did not appear in 
22.05:

1. slurmd processes acquire many network connections in CLOSE-WAIT (or 
CLOSE_WAIT depending on the tool used) causing the processes to hang, when 
trying to restart slurmd.

When checking for CLOSE-WAIT processes we see the following behaviour:
Recv-Q Send-Q Local Address:Port  Peer Address:Port Process
1      0          10.5.2.40:6818<http://10.5.2.40:6818>     
10.5.0.43:58572<http://10.5.0.43:58572> 
users:(("slurmd",pid=1930095,fd=72),("slurmd",pid=1930067,fd=72))
1      0          10.5.2.40:6818<http://10.5.2.40:6818>     
10.5.0.43:58284<http://10.5.0.43:58284> 
users:(("slurmd",pid=1930095,fd=8),("slurmd",pid=1930067,fd=8))
1      0          10.5.2.40:6818<http://10.5.2.40:6818>     
10.5.0.43:58186<http://10.5.0.43:58186> 
users:(("slurmd",pid=1930095,fd=22),("slurmd",pid=1930067,fd=22))
1      0          10.5.2.40:6818<http://10.5.2.40:6818>     
10.5.0.43:58592<http://10.5.0.43:58592> 
users:(("slurmd",pid=1930095,fd=76),("slurmd",pid=1930067,fd=76))
1      0          10.5.2.40:6818<http://10.5.2.40:6818>     
10.5.0.43:58338<http://10.5.0.43:58338> 
users:(("slurmd",pid=1930095,fd=19),("slurmd",pid=1930067,fd=19))
1      0          10.5.2.40:6818<http://10.5.2.40:6818>     
10.5.0.43:58568<http://10.5.0.43:58568> 
users:(("slurmd",pid=1930095,fd=68),("slurmd",pid=1930067,fd=68))
1      0          10.5.2.40:6818<http://10.5.2.40:6818>     
10.5.0.43:58472<http://10.5.0.43:58472> 
users:(("slurmd",pid=1930095,fd=69),("slurmd",pid=1930067,fd=69))
1      0          10.5.2.40:6818<http://10.5.2.40:6818>     
10.5.0.43:58486<http://10.5.0.43:58486> 
users:(("slurmd",pid=1930095,fd=38),("slurmd",pid=1930067,fd=38))
1      0          10.5.2.40:6818<http://10.5.2.40:6818>     
10.5.0.43:58316<http://10.5.0.43:58316> 
users:(("slurmd",pid=1930095,fd=29),("slurmd",pid=1930067,fd=29))

The first IP address is that of the compute node, the second that of the node 
running slurmctld. The nodes can communicate using these IP addresses just fine.

2. slurmd cannot be properly restarted
[2024-01-18T10:45:26.589] slurmd version 23.11.1 started
[2024-01-18T10:45:26.593] error: Error binding slurm stream socket: Address 
already in use
[2024-01-18T10:45:26.593] fatal: Unable to bind listen port (6818): Address 
already in use

This is probably because of the processes being in CLOSE-WAIT, which can only 
be killed using signal -9.

3. We see jobs stuck in completing CG state, probably due to communication 
issues between slurmctld and slurmd. The slurmctld sends repeated kill requests 
but those do not seem to be acknowledged by the client. This happens more often 
in large job arrays, or generally when many jobs start at the same time. 
However, this could be just a biased observation (i.e., it is more noticeable 
on large job arrays because there are more jobs to fail in the first place).

4. Since the new version we also see messages like:
[2024-01-17T09:58:48.589] error: Failed to kill program loading user environment
[2024-01-17T09:58:48.590] error: Failed to load current user environment 
variables
[2024-01-17T09:58:48.590] error: _get_user_env: Unable to get user's local 
environment, running only with passed environment
The effect of this is that the users run with the wrong environment and can’t 
load the modules for the software that is needed by their jobs. This leads to 
many job failures.

The issue appears to be somewhat similar to the one described at: 
https://bugs.schedmd.com/show_bug.cgi?id=18561
In that case the site downgraded the slurmd clients to 22.05 which got rid of 
the problems.
We’ve now downgraded the slurmd on the compute nodes to 23.02.7 which also 
seems to be a workaround for the issue.

Does anyone know of a better solution?

Kind regards,

Fokke Dijkstra

--
Fokke Dijkstra <f.dijks...@rug.nl><mailto:f.dijks...@rug.nl>
Team High Performance Computing
Center for Information Technology, University of Groningen
Postbus 11044, 9700 CA  Groningen, The Netherlands

Reply via email to