IIRC, Preemption is determined by partition first, not node.
Since your pending job is in the 'day' partition, it will not preempt
something in the 'night' partition (even if the node is in both).
Brian Andrus
On 8/19/2021 2:49 PM, Russell Jones wrote:
Hi all,
I could use some help to understand why preemption is not working for
me properly. I have a job blocking other jobs that doesn't make sense
to me. Any assistance is appreciated, thank you!
I have two partitions defined in slurm, a day time and a night time
pariition:
Day partition - PriorityTier of 5, always Up. Limited resources
under this QOS.
Night partition - PriorityTier of 5 during night time, during day
time set to Down and PriorityTier changed to 1. Jobs can be
submitted to night queue for an unlimited QOS as long as resources
are available.
The thought here is jobs can continue to run in the night
partition, even during the day time, until resources are requested
from the day partition. Jobs would then be requeued/canceled in
the night partition to satisfy those requirements.
Current output of "scontrol show part" :
PartitionName=day
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=part_day
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO
GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED
Nodes=cluster-r1n[01-13],cluster-r2n[01-08]
PriorityJobFactor=1 PriorityTier=5 RootOnly=NO ReqResv=NO
OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=REQUEUE
State=UP TotalCPUs=336 TotalNodes=21 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
PartitionName=night
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=part_night
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO
GraceTime=0 Hidden=NO
MaxNodes=22 MaxTime=7-00:00:00 MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED
Nodes=cluster-r1n[01-13],cluster-r2n[01-08]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=REQUEUE
State=DOWN TotalCPUs=336 TotalNodes=21 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
I currently have a job in the night partition that is blocking jobs in
the day partition, even though the day partition has a PriorityTier of
5, and night partition is Down with a PriorityTier of 1.
My current slurm.conf preemption settings are:
PreemptMode=REQUEUE
PreemptType=preempt/partition_prio
The blocking job's scontrol show job output is:
JobId=105713 JobName=jobname
Priority=1986 Nice=0 Account=xxx QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=17:49:39 TimeLimit=7-00:00:00 TimeMin=N/A
SubmitTime=2021-08-18T22:36:36 EligibleTime=2021-08-18T22:36:36
AccrueTime=2021-08-18T22:36:36
StartTime=2021-08-18T22:36:39 EndTime=2021-08-25T22:36:39
Deadline=N/A
PreemptEligibleTime=2021-08-18T22:36:39 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-18T22:36:39
Partition=night AllocNode:Sid=cluster-1:1341505
ReqNodeList=(null) ExcNodeList=(null)
NodeList=cluster-r1n[12-13],cluster-r2n[04-06]
BatchHost=cluster-r1n12
NumNodes=5 NumCPUs=80 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=80,node=5,billing=80,gres/gpu=20
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
The job that is being blocked:
JobId=105876 JobName=bash
Priority=2103 Nice=0 Account=xxx QOS=normal
JobState=PENDING
Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions
Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
SubmitTime=2021-08-19T16:19:23 EligibleTime=2021-08-19T16:19:23
AccrueTime=2021-08-19T16:19:23
StartTime=Unknown EndTime=Unknown Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-19T16:26:43
Partition=day AllocNode:Sid=cluster-1:2776451
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=3 NumCPUs=40 NumTasks=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=40,node=1,billing=40
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Why is the day job not preempting the night job?