This is a quick update on the status. Upgrading to Slurm 23.11.4 fixed the
issue. It appears we were bitten by the following bug:
-- Fix stuck processes and incorrect environment when using --get-user-env
This was triggered for us because we had set SBATCH_EXPORT=NONE for our
users.

Kind regards,

Fokke

Op wo 24 jan 2024 om 16:19 schreef Fokke Dijkstra <f.dijks...@rug.nl>:

> Dear Brian,
>
> Thanks for the hints, I think you are correctly pointing at some network
> connection issue. I've disabled firewalld on the control host, but that
> unfortunately did not help. The processes stuck in CLOSE-WAIT suggest
> indeed that network connections are not properly terminated.
> I've tried to adjust some network settings based on the suggestions for
> high throughput environments, but this didn't help either.
> I've also updated to 23.11.2, which also did not fix the problem.
> I am able to reproduce the issue by swamping a node with array jobs (128
> jobs per node). This results in many of these jobs no longer being tracked
> correctly by the scheduler. Output for these jobs is missing and hundreds
> of slurmd connections are stuck in CLOSE-WAIT.
>
> With slurmd at 23.02.7 and slurmctld and slurmdbd at 23.11.2 the problem
> does not show up.
> BTW I discovered that the job submission hosts must be at the same level
> as the compute nodes. Having the submission hosts still at 23.11 resulted
> in startup failures for jobs using Intel MPI. For the record the error
> message I got whas:
> [mpiexec@node24] check_exit_codes
> (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117):
> unable to run bstrap_proxy on node24 (pid 1909692, exit code 65280)
> [mpiexec@node24] poll_for_event
> (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159):
> check exit codes error
> [mpiexec@node24] HYD_dmx_poll_wait_for_proxy_event
> (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll
> for event error
> [mpiexec@node24] HYD_bstrap_setup
> (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061):
> error waiting for event
> [mpiexec@node24] HYD_print_bstrap_setup_error_message
> (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error
> setting up the bootstrap proxies
> [mpiexec@node24] Possible reasons:
> [mpiexec@node24] 1. Host is unavailable. Please check that all hosts are
> available.
> [mpiexec@node24] 2. Cannot launch hydra_bstrap_proxy or it crashed on one
> of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it
> has right permissions.
> [mpiexec@node24] 3. Firewall refused connection. Check that enough ports
> are allowed in the firewall and specify them with the I_MPI_PORT_RANGE
> variable.
> [mpiexec@node24] 4. slurm bootstrap cannot launch processes on remote
> host. You may try using -bootstrap option to select alternative launcher.
>
> I spent a whole day debugging the issue and could not find any clue what
> was causing this. OpenMPI jobs work fine. Finally I discovered that making
> sure the job submission hosts and the compute nodes are at the same Slurm
> level fixes the issue.
>
> For now we'll stay in this configuration, to prevent having to get rid of
> all running and waiting jobs for a slurmctld downgrade. I'll test new Slurm
> 23.11 releases when they appear to see if that fixes the issue.
>
> Kind regards,
>
> Fokke
>
> Op di 23 jan 2024 om 18:37 schreef Brian Haymore <brian.haym...@utah.edu>:
>
>> Do you have a firewall between the slurmd and the slurmctld daemons?  If
>> yes, do you know what kind of idle timeout that firewall has for expiring
>> idle sessions?  I ran into something somewhat similar but for me it was
>> between the slurmctld and slurmdbd where a recent change they made had one
>> direction between those two daemons left idle unless certain operations
>> occurred and we did have a firewall device between them that was expiring
>> sessions.  In our case 23.11.1 brought a fix for that specific issue for
>> us.  I never had issues between slurmctld and slurmd (though the firewall
>> is not between those two layers).
>>
>> --
>> Brian D. Haymore
>> University of Utah
>> Center for High Performance Computing
>> 155 South 1452 East RM 405
>> Salt Lake City, Ut 84112
>> Phone: 801-558-1150
>> http://bit.ly/1HO1N2C
>> ------------------------------
>> *From:* slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of
>> Fokke Dijkstra <f.dijks...@rug.nl>
>> *Sent:* Tuesday, January 23, 2024 4:00 AM
>> *To:* slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
>> *Subject:* [slurm-users] Issues with Slurm 23.11.1
>>
>> Dear all,
>>
>> Since the upgrade from Slurm 22.05 to 23.11.1 we are having problems with
>> the communication between the slurmctld and slurmd processes.
>> We are running a cluster with 183 nodes and almost 19000 cores.
>> Unfortunately some nodes are in a different network preventing full
>> internode communication. A network topology and setting TopologyParam
>> RouteTree have been used to make sure no slurmd communication happens
>> between nodes on different networks.
>>
>> In the new Slurm version we see the following issues, which did not
>> appear in 22.05:
>>
>> 1. slurmd processes acquire many network connections in CLOSE-WAIT (or
>> CLOSE_WAIT depending on the tool used) causing the processes to hang, when
>> trying to restart slurmd.
>>
>> When checking for CLOSE-WAIT processes we see the following behaviour:
>> Recv-Q Send-Q Local Address:Port  Peer Address:Port Process
>>
>> 1      0          10.5.2.40:6818     10.5.0.43:58572
>> users:(("slurmd",pid=1930095,fd=72),("slurmd",pid=1930067,fd=72))
>> 1      0          10.5.2.40:6818     10.5.0.43:58284
>> users:(("slurmd",pid=1930095,fd=8),("slurmd",pid=1930067,fd=8))
>> 1      0          10.5.2.40:6818     10.5.0.43:58186
>> users:(("slurmd",pid=1930095,fd=22),("slurmd",pid=1930067,fd=22))
>> 1      0          10.5.2.40:6818     10.5.0.43:58592
>> users:(("slurmd",pid=1930095,fd=76),("slurmd",pid=1930067,fd=76))
>> 1      0          10.5.2.40:6818     10.5.0.43:58338
>> users:(("slurmd",pid=1930095,fd=19),("slurmd",pid=1930067,fd=19))
>> 1      0          10.5.2.40:6818     10.5.0.43:58568
>> users:(("slurmd",pid=1930095,fd=68),("slurmd",pid=1930067,fd=68))
>> 1      0          10.5.2.40:6818     10.5.0.43:58472
>> users:(("slurmd",pid=1930095,fd=69),("slurmd",pid=1930067,fd=69))
>> 1      0          10.5.2.40:6818     10.5.0.43:58486
>> users:(("slurmd",pid=1930095,fd=38),("slurmd",pid=1930067,fd=38))
>> 1      0          10.5.2.40:6818     10.5.0.43:58316
>> users:(("slurmd",pid=1930095,fd=29),("slurmd",pid=1930067,fd=29))
>>
>> The first IP address is that of the compute node, the second that of the
>> node running slurmctld. The nodes can communicate using these IP addresses
>> just fine.
>>
>> 2. slurmd cannot be properly restarted
>> [2024-01-18T10:45:26.589] slurmd version 23.11.1 started
>> [2024-01-18T10:45:26.593] error: Error binding slurm stream socket:
>> Address already in use
>> [2024-01-18T10:45:26.593] fatal: Unable to bind listen port (6818):
>> Address already in use
>>
>> This is probably because of the processes being in CLOSE-WAIT, which can
>> only be killed using signal -9.
>>
>> 3. We see jobs stuck in completing CG state, probably due to
>> communication issues between slurmctld and slurmd. The slurmctld sends
>> repeated kill requests but those do not seem to be acknowledged by the
>> client. This happens more often in large job arrays, or generally when many
>> jobs start at the same time. However, this could be just a biased
>> observation (i.e., it is more noticeable on large job arrays because there
>> are more jobs to fail in the first place).
>>
>> 4. Since the new version we also see messages like:
>> [2024-01-17T09:58:48.589] error: Failed to kill program loading user
>> environment
>> [2024-01-17T09:58:48.590] error: Failed to load current user environment
>> variables
>> [2024-01-17T09:58:48.590] error: _get_user_env: Unable to get user's
>> local environment, running only with passed environment
>> The effect of this is that the users run with the wrong environment and
>> can’t load the modules for the software that is needed by their jobs. This
>> leads to many job failures.
>>
>> The issue appears to be somewhat similar to the one described at:
>> https://bugs.schedmd.com/show_bug.cgi?id=18561
>> In that case the site downgraded the slurmd clients to 22.05 which got
>> rid of the problems.
>> We’ve now downgraded the slurmd on the compute nodes to 23.02.7 which
>> also seems to be a workaround for the issue.
>>
>> Does anyone know of a better solution?
>>
>> Kind regards,
>>
>> Fokke Dijkstra
>>
>> --
>> Fokke Dijkstra <f.dijks...@rug.nl> <f.dijks...@rug.nl>
>> Team High Performance Computing
>> Center for Information Technology, University of Groningen
>> Postbus 11044, 9700 CA  Groningen, The Netherlands
>>
>>
>
> --
> Fokke Dijkstra <f.dijks...@rug.nl> <f.dijks...@rug.nl>
> Team High Performance Computing
> Center for Information Technology, University of Groningen
> Postbus 11044, 9700 CA  Groningen, The Netherlands
>
>

-- 
Fokke Dijkstra <f.dijks...@rug.nl> <f.dijks...@rug.nl>
Team High Performance Computing
Center for Information Technology, University of Groningen
Postbus 11044, 9700 CA  Groningen, The Netherlands
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to