Hi, I am trying to enable gang scheduling on a server with a CPU with 32 cores and 4 GPUs.
However, using Gang sched, the cpu jobs (or gpu jobs) are not being preempted after the time slice, which is set to 30 secs. Below is a snapshot of squeue. There are 3 jobs each needing 32 cores. The first 2 jobs launched are never preempted. The 3rd job is forever (or at least until one of the other 2 ends) starving: JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 313 asimov01 cpu-only hdaniel PD 0:00 1 (Resources) 311 asimov01 cpu-only hdaniel R 1:52 1 asimov 312 asimov01 cpu-only hdaniel R 1:49 1 asimov The same happens with GPU jobs. If I launch 5 jobs, requiring one GPU each, the 5th job will never run. The preemption is not working with the specified timeslice. I tried several combinations: SchedulerType=sched/builtin and backfill SelectType=select/cons_tres and linear I'll appreciate any help and suggestions The slurm.conf is below. Thanks ClusterName=asimov SlurmctldHost=localhost MpiDefault=none ProctrackType=proctrack/linuxproc # proctrack/cgroup ReturnToService=2 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm/slurmd SlurmUser=slurm StateSaveLocation=/var/lib/slurm/slurmctld SwitchType=switch/none TaskPlugin=task/none # task/cgroup # # TIMERS InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 # # SCHEDULING #FastSchedule=1 #obsolete SchedulerType=sched/builtin #backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core #CR_Core_Memory let's only one job run at a time PreemptType = preempt/partition_prio PreemptMode = SUSPEND,GANG SchedulerTimeSlice=30 #in seconds, default 30 # # LOGGING AND ACCOUNTING #AccountingStoragePort= AccountingStorageType=accounting_storage/none #AccountingStorageEnforce=associations #ClusterName=bip-cluster JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm/slurmd.log # # # COMPUTE NODES #NodeName=asimov CPUs=64 RealMemory=500 State=UNKNOWN #PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP # Partitions GresTypes=gpu NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN PartitionName=asimov01 Nodes=asimov Default=YES MaxTime=INFINITE MaxNodes=1 DefCpuPerGPU=2 State=UP