On 10/28/22 08:30, Richard Chang wrote:
Yes, the system is a HPE Cray EX, and I am trying to use
switch/hpe_slingshot.
I see that Slurm 22.05 has added support for "switch/hpe_slingshot" with
HPE Slingshot systems:
> SwitchType
> Identifies the type of switch or interconnect used for application
communications. Acceptable values include "switch/cray_aries" for Cray
systems, "switch/hpe_slingshot" for HPE Slingshot systems and
"switch/none" for switches not requiring special processing for job launch
or termination (Ethernet, and InfiniBand). The default value is
"switch/none". All Slurm daemons, commands and running jobs must be
restarted for a change in SwitchType to take effect. If running jobs exist
at the time slurmctld is restarted with a new value of SwitchType, records
of all jobs in any state may be lost.
You probably need to contact your HPE support people. A support contract
with SchedMD is highly recommended when you have a complex setup with very
new technology. See https://www.schedmd.com/support.php
/Ole
On 10/28/2022 11:21 AM, Ole Holm Nielsen wrote:
On 10/28/22 07:35, Richard Chang wrote:
I have observed that when I specify a switch type in the slurm.conf
file and that particular switch type is not present in the slurmctld
node, slurmctld panics and shuts down. Is this expected ? My slurmctld
doesn't have the switch type, but the computes have that switch type.
how can I set it up so that it can utilise the feature but not break
slurm.
What is you line in slurm.conf? The manual page seems to describe what
you have observed:
SwitchType
Identifies the type of switch or interconnect used for
applica‐
tion communications. Acceptable values include
"switch/cray_aries" for Cray systems, "switch/none" for
switches
not requiring special processing for job launch or
termination
(Ethernet, and InfiniBand) and The default value is
"switch/none". All Slurm daemons, commands and
running jobs
must be restarted for a change in SwitchType to take
effect. If
running jobs exist at the time slurmctld is restarted with
a new
value of SwitchType, records of all jobs in any state
may be
lost.
Why do you want to use this configuration? Is your system a Cray?