Hi all, I could use some help to understand why preemption is not working for me properly. I have a job blocking other jobs that doesn't make sense to me. Any assistance is appreciated, thank you!
I have two partitions defined in slurm, a day time and a night time pariition: Day partition - PriorityTier of 5, always Up. Limited resources under this QOS. Night partition - PriorityTier of 5 during night time, during day time set to Down and PriorityTier changed to 1. Jobs can be submitted to night queue for an unlimited QOS as long as resources are available. The thought here is jobs can continue to run in the night partition, even during the day time, until resources are requested from the day partition. Jobs would then be requeued/canceled in the night partition to satisfy those requirements. Current output of "scontrol show part" : PartitionName=day AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=part_day DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=cluster-r1n[01-13],cluster-r2n[01-08] PriorityJobFactor=1 PriorityTier=5 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=REQUEUE State=UP TotalCPUs=336 TotalNodes=21 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED PartitionName=night AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=part_night DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO MaxNodes=22 MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED Nodes=cluster-r1n[01-13],cluster-r2n[01-08] PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO OverTimeLimit=NONE PreemptMode=REQUEUE State=DOWN TotalCPUs=336 TotalNodes=21 SelectTypeParameters=NONE JobDefaults=(null) DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED I currently have a job in the night partition that is blocking jobs in the day partition, even though the day partition has a PriorityTier of 5, and night partition is Down with a PriorityTier of 1. My current slurm.conf preemption settings are: PreemptMode=REQUEUE PreemptType=preempt/partition_prio The blocking job's scontrol show job output is: JobId=105713 JobName=jobname Priority=1986 Nice=0 Account=xxx QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=17:49:39 TimeLimit=7-00:00:00 TimeMin=N/A SubmitTime=2021-08-18T22:36:36 EligibleTime=2021-08-18T22:36:36 AccrueTime=2021-08-18T22:36:36 StartTime=2021-08-18T22:36:39 EndTime=2021-08-25T22:36:39 Deadline=N/A PreemptEligibleTime=2021-08-18T22:36:39 PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-18T22:36:39 Partition=night AllocNode:Sid=cluster-1:1341505 ReqNodeList=(null) ExcNodeList=(null) NodeList=cluster-r1n[12-13],cluster-r2n[04-06] BatchHost=cluster-r1n12 NumNodes=5 NumCPUs=80 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=80,node=5,billing=80,gres/gpu=20 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) The job that is being blocked: JobId=105876 JobName=bash Priority=2103 Nice=0 Account=xxx QOS=normal JobState=PENDING Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions Dependency=(null) Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2021-08-19T16:19:23 EligibleTime=2021-08-19T16:19:23 AccrueTime=2021-08-19T16:19:23 StartTime=Unknown EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-19T16:26:43 Partition=day AllocNode:Sid=cluster-1:2776451 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=3 NumCPUs=40 NumTasks=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=40,node=1,billing=40 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Why is the day job not preempting the night job?