This is a quick update on the status. Upgrading to Slurm 23.11.4 fixed the issue. It appears we were bitten by the following bug: -- Fix stuck processes and incorrect environment when using --get-user-env This was triggered for us because we had set SBATCH_EXPORT=NONE for our users.
Kind regards, Fokke Op wo 24 jan 2024 om 16:19 schreef Fokke Dijkstra <f.dijks...@rug.nl>: > Dear Brian, > > Thanks for the hints, I think you are correctly pointing at some network > connection issue. I've disabled firewalld on the control host, but that > unfortunately did not help. The processes stuck in CLOSE-WAIT suggest > indeed that network connections are not properly terminated. > I've tried to adjust some network settings based on the suggestions for > high throughput environments, but this didn't help either. > I've also updated to 23.11.2, which also did not fix the problem. > I am able to reproduce the issue by swamping a node with array jobs (128 > jobs per node). This results in many of these jobs no longer being tracked > correctly by the scheduler. Output for these jobs is missing and hundreds > of slurmd connections are stuck in CLOSE-WAIT. > > With slurmd at 23.02.7 and slurmctld and slurmdbd at 23.11.2 the problem > does not show up. > BTW I discovered that the job submission hosts must be at the same level > as the compute nodes. Having the submission hosts still at 23.11 resulted > in startup failures for jobs using Intel MPI. For the record the error > message I got whas: > [mpiexec@node24] check_exit_codes > (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): > unable to run bstrap_proxy on node24 (pid 1909692, exit code 65280) > [mpiexec@node24] poll_for_event > (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): > check exit codes error > [mpiexec@node24] HYD_dmx_poll_wait_for_proxy_event > (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll > for event error > [mpiexec@node24] HYD_bstrap_setup > (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): > error waiting for event > [mpiexec@node24] HYD_print_bstrap_setup_error_message > (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error > setting up the bootstrap proxies > [mpiexec@node24] Possible reasons: > [mpiexec@node24] 1. Host is unavailable. Please check that all hosts are > available. > [mpiexec@node24] 2. Cannot launch hydra_bstrap_proxy or it crashed on one > of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it > has right permissions. > [mpiexec@node24] 3. Firewall refused connection. Check that enough ports > are allowed in the firewall and specify them with the I_MPI_PORT_RANGE > variable. > [mpiexec@node24] 4. slurm bootstrap cannot launch processes on remote > host. You may try using -bootstrap option to select alternative launcher. > > I spent a whole day debugging the issue and could not find any clue what > was causing this. OpenMPI jobs work fine. Finally I discovered that making > sure the job submission hosts and the compute nodes are at the same Slurm > level fixes the issue. > > For now we'll stay in this configuration, to prevent having to get rid of > all running and waiting jobs for a slurmctld downgrade. I'll test new Slurm > 23.11 releases when they appear to see if that fixes the issue. > > Kind regards, > > Fokke > > Op di 23 jan 2024 om 18:37 schreef Brian Haymore <brian.haym...@utah.edu>: > >> Do you have a firewall between the slurmd and the slurmctld daemons? If >> yes, do you know what kind of idle timeout that firewall has for expiring >> idle sessions? I ran into something somewhat similar but for me it was >> between the slurmctld and slurmdbd where a recent change they made had one >> direction between those two daemons left idle unless certain operations >> occurred and we did have a firewall device between them that was expiring >> sessions. In our case 23.11.1 brought a fix for that specific issue for >> us. I never had issues between slurmctld and slurmd (though the firewall >> is not between those two layers). >> >> -- >> Brian D. Haymore >> University of Utah >> Center for High Performance Computing >> 155 South 1452 East RM 405 >> Salt Lake City, Ut 84112 >> Phone: 801-558-1150 >> http://bit.ly/1HO1N2C >> ------------------------------ >> *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of >> Fokke Dijkstra <f.dijks...@rug.nl> >> *Sent:* Tuesday, January 23, 2024 4:00 AM >> *To:* slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com> >> *Subject:* [slurm-users] Issues with Slurm 23.11.1 >> >> Dear all, >> >> Since the upgrade from Slurm 22.05 to 23.11.1 we are having problems with >> the communication between the slurmctld and slurmd processes. >> We are running a cluster with 183 nodes and almost 19000 cores. >> Unfortunately some nodes are in a different network preventing full >> internode communication. A network topology and setting TopologyParam >> RouteTree have been used to make sure no slurmd communication happens >> between nodes on different networks. >> >> In the new Slurm version we see the following issues, which did not >> appear in 22.05: >> >> 1. slurmd processes acquire many network connections in CLOSE-WAIT (or >> CLOSE_WAIT depending on the tool used) causing the processes to hang, when >> trying to restart slurmd. >> >> When checking for CLOSE-WAIT processes we see the following behaviour: >> Recv-Q Send-Q Local Address:Port Peer Address:Port Process >> >> 1 0 10.5.2.40:6818 10.5.0.43:58572 >> users:(("slurmd",pid=1930095,fd=72),("slurmd",pid=1930067,fd=72)) >> 1 0 10.5.2.40:6818 10.5.0.43:58284 >> users:(("slurmd",pid=1930095,fd=8),("slurmd",pid=1930067,fd=8)) >> 1 0 10.5.2.40:6818 10.5.0.43:58186 >> users:(("slurmd",pid=1930095,fd=22),("slurmd",pid=1930067,fd=22)) >> 1 0 10.5.2.40:6818 10.5.0.43:58592 >> users:(("slurmd",pid=1930095,fd=76),("slurmd",pid=1930067,fd=76)) >> 1 0 10.5.2.40:6818 10.5.0.43:58338 >> users:(("slurmd",pid=1930095,fd=19),("slurmd",pid=1930067,fd=19)) >> 1 0 10.5.2.40:6818 10.5.0.43:58568 >> users:(("slurmd",pid=1930095,fd=68),("slurmd",pid=1930067,fd=68)) >> 1 0 10.5.2.40:6818 10.5.0.43:58472 >> users:(("slurmd",pid=1930095,fd=69),("slurmd",pid=1930067,fd=69)) >> 1 0 10.5.2.40:6818 10.5.0.43:58486 >> users:(("slurmd",pid=1930095,fd=38),("slurmd",pid=1930067,fd=38)) >> 1 0 10.5.2.40:6818 10.5.0.43:58316 >> users:(("slurmd",pid=1930095,fd=29),("slurmd",pid=1930067,fd=29)) >> >> The first IP address is that of the compute node, the second that of the >> node running slurmctld. The nodes can communicate using these IP addresses >> just fine. >> >> 2. slurmd cannot be properly restarted >> [2024-01-18T10:45:26.589] slurmd version 23.11.1 started >> [2024-01-18T10:45:26.593] error: Error binding slurm stream socket: >> Address already in use >> [2024-01-18T10:45:26.593] fatal: Unable to bind listen port (6818): >> Address already in use >> >> This is probably because of the processes being in CLOSE-WAIT, which can >> only be killed using signal -9. >> >> 3. We see jobs stuck in completing CG state, probably due to >> communication issues between slurmctld and slurmd. The slurmctld sends >> repeated kill requests but those do not seem to be acknowledged by the >> client. This happens more often in large job arrays, or generally when many >> jobs start at the same time. However, this could be just a biased >> observation (i.e., it is more noticeable on large job arrays because there >> are more jobs to fail in the first place). >> >> 4. Since the new version we also see messages like: >> [2024-01-17T09:58:48.589] error: Failed to kill program loading user >> environment >> [2024-01-17T09:58:48.590] error: Failed to load current user environment >> variables >> [2024-01-17T09:58:48.590] error: _get_user_env: Unable to get user's >> local environment, running only with passed environment >> The effect of this is that the users run with the wrong environment and >> can’t load the modules for the software that is needed by their jobs. This >> leads to many job failures. >> >> The issue appears to be somewhat similar to the one described at: >> https://bugs.schedmd.com/show_bug.cgi?id=18561 >> In that case the site downgraded the slurmd clients to 22.05 which got >> rid of the problems. >> We’ve now downgraded the slurmd on the compute nodes to 23.02.7 which >> also seems to be a workaround for the issue. >> >> Does anyone know of a better solution? >> >> Kind regards, >> >> Fokke Dijkstra >> >> -- >> Fokke Dijkstra <f.dijks...@rug.nl> <f.dijks...@rug.nl> >> Team High Performance Computing >> Center for Information Technology, University of Groningen >> Postbus 11044, 9700 CA Groningen, The Netherlands >> >> > > -- > Fokke Dijkstra <f.dijks...@rug.nl> <f.dijks...@rug.nl> > Team High Performance Computing > Center for Information Technology, University of Groningen > Postbus 11044, 9700 CA Groningen, The Netherlands > > -- Fokke Dijkstra <f.dijks...@rug.nl> <f.dijks...@rug.nl> Team High Performance Computing Center for Information Technology, University of Groningen Postbus 11044, 9700 CA Groningen, The Netherlands
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com