[slurm-users] Shard conf weirdness

Reed Dier via slurm-users Mon, 24 Feb 2025 09:01:26 -0800

Hoping someone can help me pin down the weirdness I’m experiencing.

There are actually two issues, I’ve run into, the root issue, and then 
something odd when trying to work around the root issue.


v23.11.10 - Ubuntu 22.04 - slurm-smd debs built 
<https://slurm.schedmd.com/archive/slurm-23.11.10/quickstart_admin.html#debuild>
 from the tarball.

I have 2 slurmctld daemons, and 1 slurmdbd daemon.
The slurm.conf is consistent across the cluster, all ctld,dbd,slurmd, have the 
same shasum hash.

The backup slurmctld does not appear to like the gres configured on my gpu 
nodes, despite the primary slurmctld having no issues.
When failing over with scontrol takeover, I get the following log messages on 
the secondary slurmctld, where it complains about a reported, but not 
configured typed gres/shard.

> [2025-02-21T05:02:02.017] error: Setting node $HOST2 state to INVAL with 
> reason:gres/shard type (p100) reported but not configured
> [2025-02-21T05:02:02.018] drain_nodes: node $HOST2 state set to DRAIN
> [2025-02-21T05:02:02.018] error: _slurm_rpc_node_registration node=$HOST2: 
> Invalid argument
> [2025-02-21T05:02:02.020] error: Setting node $HOST3 state to INVAL with 
> reason:gres/shard type (p40) reported but not configured
> [2025-02-21T05:02:02.020] drain_nodes: node $HOST3 state set to DRAIN
> [2025-02-21T05:02:02.020] error: _slurm_rpc_node_registration node=$HOST3: 
> Invalid argument
> [2025-02-21T05:02:02.023] error: Setting node $HOST1 state to INVAL with 
> reason:gres/shard type (t4) reported but not configured
> [2025-02-21T05:02:02.020] drain_nodes: node $HOST1 state set to DRAIN
> [2025-02-21T05:02:02.023] error: _slurm_rpc_node_registration node=$HOST1: 
> Invalid argument

And looking at those hosts in the slurm.conf, the shards are not typed, but 
generic.
> NodeName=$HOST1 [SNIP]   State=UNKNOWN   Gres=gpu:t4:2,shard:8
> NodeName=$HOST2 [SNIP]   State=UNKNOWN   Gres=gpu:p100:2,shard:8   
> NodeName=$HOST3 [SNIP]   State=UNKNOWN   Gres=gpu:p40:2,gpu:p100:2,shard:16   


I should also point out at this point that my gres.conf is just AutoDetect=nvml.
I am not explicitly mapping any devices.

So this is problem 1, the slurmctld behaving differently between daemons, where 
the primary has zero issue with the configuration, but the secondary balking 
and draining the nodes consistently.
Moving on to problem 2, which is me trying to solve the issue, and running into 
a different issue.

I decided to try to add the typed gres/shards to both the 
AccountingStorageTRES, and the NodeName lists.
> NodeName=$HOST1 [SNIP]   State=UNKNOWN   Gres=gpu:t4:2,shard:t4:8
> NodeName=$HOST2 [SNIP]   State=UNKNOWN   Gres=gpu:p100:2,shard:p100:8   
> NodeName=$HOST3 [SNIP]   State=UNKNOWN   
> Gres=gpu:p40:2,gpu:p100:2,shard:p40:8,shard:p100:8   


Now the secondary slurmctld is happy and no longer immediately drains the gpu 
nodes for the invalid gres.
However, the node ($HOST3) with a mixed set of gpus is now complaining.
> [2025-02-23T21:41:05.311] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system 
> device(s) detected
> [2025-02-23T21:41:05.311] fatal: _build_shared_list: bad configuration, 
> multiple configurations without "File"

No matter what I tried, any line of the NodeName with multiple shard:$type:$num 
generates the _build_shared_list error above.
I tried re-ordering the list as gpu,shard,gpu,shard to no avail, I tried having 
untyped shards, with typed shards after that, but then it complained about too 
many shards (double the intended number, $untyped + $typed = too many)

So my current solution is to have the slurm conf in sync everywhere BUT the 
host(s) with multiple gpu models in the same host, where I had to revert to 
untyped gres on those slurmd’s, but on the slurmctlds they are typed gres.

Hopefully I’ve done a decent job of explaining the corner case, enough that 
someone can point me in the direction of figuring out whats going on, and what 
is the “correct” way of doing this.
I tried increasing the slurmd and slurmctld logging to debug2, to nothing that 
stood out beyond what was already gathered.
> [2025-02-23T21:46:44.895] debug:      GRES[shard] Type:p100 Count:8 
> Cores(88):(null)  Links:(null) Flags:HAS_TYPE File:(null) UniqueId:(null)
> [2025-02-23T21:46:44.895] debug:      GRES[shard] Type:p40 Count:8 
> Cores(88):(null)  Links:(null) Flags:HAS_TYPE File:(null) UniqueId:(null)
> [2025-02-23T21:46:44.895] fatal: _build_shared_list: bad configuration, 
> multiple configurations without "File"


Any ideas are greatly appreciated,
Reed

smime.p7s
Description: S/MIME cryptographic signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Shard conf weirdness

Reply via email to