I have 2 nodes that offer a "gc" feature. Node t-gc-1202 is "normal", and node
t-gc-1201 is dynamic. I can successfully remove t-gc-1201 and bring it back
dynamically. Once I bring it back, that node appears JUST LIKE the "normal"
node in the sinfo output, as seen here:
[rug262@testsch (RC) slurm] sinfo -o "%20N %10c %10m %25f %10G "
NODELIST CPUS MEMORY AVAIL_FEATURES GRES
t-sc-[1101-1104] 48 358400 nogpu,sc (null)
t-gc-1201 48 385420 gpu,gc,a100
gpu:2(S:0-
t-gc-1202 48 358400 gpu,gc,a100 gpu:2
t-ic-1051 36 500000 ic,a40 (null)
When I execute a job requiring 24 CPUs and the gc feature, then it runs on
t-gc-1202 only. If I sbatch 3 of the same jobs at once, then 2 run on
t-gc-1202 and the 3rd is pending for resources.
[rug262@testsch (RC) slurm] squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
405 open-requ gpu_test rug262 PD 0:00 1 (Resources)
404 open-requ gpu_test rug262 R 0:06 1 t-gc-1202
403 open-requ gpu_test rug262 R 0:07 1 t-gc-1202
Both nodes show up in the partitions and show idle before starting the jobs:
[rug262@testsch (RC) slurm] sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
open* up 2-00:00:00 4 idle t-sc-[1101-1104]
open-requeue up 2-00:00:00 6 idle t-gc-[1201-1202],t-sc-[1101-1104]
intr up 2-00:00:00 1 idle t-ic-1051
sla-prio up infinite 6 idle t-gc-[1201-1202],t-sc-[1101-1104]
burst up infinite 4 idle t-sc-[1101-1104]
burst-requeue up infinite 6 idle t-gc-[1201-1202],t-sc-[1101-1104]
debug up infinite 7 idle
t-gc-[1201-1202],t-ic-1051,t-sc-[1101-1104]
So my 2 questions:
1. How do I get my dynamic node to be utilized like the non-dynamic nodes?
2. I want to have a DIFFERENT feature on my dynamic node, that is not
present in the "normal" nodes. When a job is submitted that requires the
feature of the dynamic node, I need the job to suspend until the dynamic node
becomes available. How do I go about setting that up?
Thanks.