Hoping someone can help me pin down the weirdness I’m experiencing. There are actually two issues, I’ve run into, the root issue, and then something odd when trying to work around the root issue.
v23.11.10 - Ubuntu 22.04 - slurm-smd debs built <https://slurm.schedmd.com/archive/slurm-23.11.10/quickstart_admin.html#debuild> from the tarball. I have 2 slurmctld daemons, and 1 slurmdbd daemon. The slurm.conf is consistent across the cluster, all ctld,dbd,slurmd, have the same shasum hash. The backup slurmctld does not appear to like the gres configured on my gpu nodes, despite the primary slurmctld having no issues. When failing over with scontrol takeover, I get the following log messages on the secondary slurmctld, where it complains about a reported, but not configured typed gres/shard. > [2025-02-21T05:02:02.017] error: Setting node $HOST2 state to INVAL with > reason:gres/shard type (p100) reported but not configured > [2025-02-21T05:02:02.018] drain_nodes: node $HOST2 state set to DRAIN > [2025-02-21T05:02:02.018] error: _slurm_rpc_node_registration node=$HOST2: > Invalid argument > [2025-02-21T05:02:02.020] error: Setting node $HOST3 state to INVAL with > reason:gres/shard type (p40) reported but not configured > [2025-02-21T05:02:02.020] drain_nodes: node $HOST3 state set to DRAIN > [2025-02-21T05:02:02.020] error: _slurm_rpc_node_registration node=$HOST3: > Invalid argument > [2025-02-21T05:02:02.023] error: Setting node $HOST1 state to INVAL with > reason:gres/shard type (t4) reported but not configured > [2025-02-21T05:02:02.020] drain_nodes: node $HOST1 state set to DRAIN > [2025-02-21T05:02:02.023] error: _slurm_rpc_node_registration node=$HOST1: > Invalid argument And looking at those hosts in the slurm.conf, the shards are not typed, but generic. > NodeName=$HOST1 [SNIP] State=UNKNOWN Gres=gpu:t4:2,shard:8 > NodeName=$HOST2 [SNIP] State=UNKNOWN Gres=gpu:p100:2,shard:8 > NodeName=$HOST3 [SNIP] State=UNKNOWN Gres=gpu:p40:2,gpu:p100:2,shard:16 I should also point out at this point that my gres.conf is just AutoDetect=nvml. I am not explicitly mapping any devices. So this is problem 1, the slurmctld behaving differently between daemons, where the primary has zero issue with the configuration, but the secondary balking and draining the nodes consistently. Moving on to problem 2, which is me trying to solve the issue, and running into a different issue. I decided to try to add the typed gres/shards to both the AccountingStorageTRES, and the NodeName lists. > NodeName=$HOST1 [SNIP] State=UNKNOWN Gres=gpu:t4:2,shard:t4:8 > NodeName=$HOST2 [SNIP] State=UNKNOWN Gres=gpu:p100:2,shard:p100:8 > NodeName=$HOST3 [SNIP] State=UNKNOWN > Gres=gpu:p40:2,gpu:p100:2,shard:p40:8,shard:p100:8 Now the secondary slurmctld is happy and no longer immediately drains the gpu nodes for the invalid gres. However, the node ($HOST3) with a mixed set of gpus is now complaining. > [2025-02-23T21:41:05.311] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system > device(s) detected > [2025-02-23T21:41:05.311] fatal: _build_shared_list: bad configuration, > multiple configurations without "File" No matter what I tried, any line of the NodeName with multiple shard:$type:$num generates the _build_shared_list error above. I tried re-ordering the list as gpu,shard,gpu,shard to no avail, I tried having untyped shards, with typed shards after that, but then it complained about too many shards (double the intended number, $untyped + $typed = too many) So my current solution is to have the slurm conf in sync everywhere BUT the host(s) with multiple gpu models in the same host, where I had to revert to untyped gres on those slurmd’s, but on the slurmctlds they are typed gres. Hopefully I’ve done a decent job of explaining the corner case, enough that someone can point me in the direction of figuring out whats going on, and what is the “correct” way of doing this. I tried increasing the slurmd and slurmctld logging to debug2, to nothing that stood out beyond what was already gathered. > [2025-02-23T21:46:44.895] debug: GRES[shard] Type:p100 Count:8 > Cores(88):(null) Links:(null) Flags:HAS_TYPE File:(null) UniqueId:(null) > [2025-02-23T21:46:44.895] debug: GRES[shard] Type:p40 Count:8 > Cores(88):(null) Links:(null) Flags:HAS_TYPE File:(null) UniqueId:(null) > [2025-02-23T21:46:44.895] fatal: _build_shared_list: bad configuration, > multiple configurations without "File" Any ideas are greatly appreciated, Reed
smime.p7s
Description: S/MIME cryptographic signature
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com