I have been researching this further and I see other systems that appear to be set up the same way we are. Example: https://hpcrcf.atlassian.net/wiki/spaces/TCP/pages/733184001/How-to+Use+the+preempt+Partition
Any further insight into what may be wrong with our setup is appreciated. I am not seeing what is wrong with my config, but it also isn't working anymore to allow preemption. On Fri, Aug 20, 2021 at 9:46 AM Russell Jones <arjone...@gmail.com> wrote: > I could have swore I had tested this before implementing it and it worked > as expected. > > If I am dreaming that testing - is there a way of allowing preemption > across partitions? > > On Fri, Aug 20, 2021 at 8:40 AM Brian Andrus <toomuc...@gmail.com> wrote: > >> IIRC, Preemption is determined by partition first, not node. >> >> Since your pending job is in the 'day' partition, it will not preempt >> something in the 'night' partition (even if the node is in both). >> >> Brian Andrus >> On 8/19/2021 2:49 PM, Russell Jones wrote: >> >> Hi all, >> >> I could use some help to understand why preemption is not working for me >> properly. I have a job blocking other jobs that doesn't make sense to me. >> Any assistance is appreciated, thank you! >> >> >> I have two partitions defined in slurm, a day time and a night time >> pariition: >> >> Day partition - PriorityTier of 5, always Up. Limited resources under >> this QOS. >> Night partition - PriorityTier of 5 during night time, during day time >> set to Down and PriorityTier changed to 1. Jobs can be submitted to night >> queue for an unlimited QOS as long as resources are available. >> >> The thought here is jobs can continue to run in the night partition, even >> during the day time, until resources are requested from the day partition. >> Jobs would then be requeued/canceled in the night partition to >> satisfy those requirements. >> >> >> >> Current output of "scontrol show part" : >> >> PartitionName=day >> AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL >> AllocNodes=ALL Default=NO QoS=part_day >> DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 >> Hidden=NO >> MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO >> MaxCPUsPerNode=UNLIMITED >> Nodes=cluster-r1n[01-13],cluster-r2n[01-08] >> PriorityJobFactor=1 PriorityTier=5 RootOnly=NO ReqResv=NO >> OverSubscribe=NO >> OverTimeLimit=NONE PreemptMode=REQUEUE >> State=UP TotalCPUs=336 TotalNodes=21 SelectTypeParameters=NONE >> JobDefaults=(null) >> DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED >> >> >> PartitionName=night >> AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL >> AllocNodes=ALL Default=NO QoS=part_night >> DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 >> Hidden=NO >> MaxNodes=22 MaxTime=7-00:00:00 MinNodes=0 LLN=NO >> MaxCPUsPerNode=UNLIMITED >> Nodes=cluster-r1n[01-13],cluster-r2n[01-08] >> PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO >> OverSubscribe=NO >> OverTimeLimit=NONE PreemptMode=REQUEUE >> State=DOWN TotalCPUs=336 TotalNodes=21 SelectTypeParameters=NONE >> JobDefaults=(null) >> DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED >> >> >> >> >> I currently have a job in the night partition that is blocking jobs in >> the day partition, even though the day partition has a PriorityTier of 5, >> and night partition is Down with a PriorityTier of 1. >> >> My current slurm.conf preemption settings are: >> >> PreemptMode=REQUEUE >> PreemptType=preempt/partition_prio >> >> >> >> The blocking job's scontrol show job output is: >> >> JobId=105713 JobName=jobname >> Priority=1986 Nice=0 Account=xxx QOS=normal >> JobState=RUNNING Reason=None Dependency=(null) >> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 >> RunTime=17:49:39 TimeLimit=7-00:00:00 TimeMin=N/A >> SubmitTime=2021-08-18T22:36:36 EligibleTime=2021-08-18T22:36:36 >> AccrueTime=2021-08-18T22:36:36 >> StartTime=2021-08-18T22:36:39 EndTime=2021-08-25T22:36:39 Deadline=N/A >> PreemptEligibleTime=2021-08-18T22:36:39 PreemptTime=None >> SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-18T22:36:39 >> Partition=night AllocNode:Sid=cluster-1:1341505 >> ReqNodeList=(null) ExcNodeList=(null) >> NodeList=cluster-r1n[12-13],cluster-r2n[04-06] >> BatchHost=cluster-r1n12 >> NumNodes=5 NumCPUs=80 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >> TRES=cpu=80,node=5,billing=80,gres/gpu=20 >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* >> MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 >> Features=(null) DelayBoot=00:00:00 >> OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) >> >> >> >> The job that is being blocked: >> >> JobId=105876 JobName=bash >> Priority=2103 Nice=0 Account=xxx QOS=normal >> JobState=PENDING >> Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions >> Dependency=(null) >> Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 >> RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A >> SubmitTime=2021-08-19T16:19:23 EligibleTime=2021-08-19T16:19:23 >> AccrueTime=2021-08-19T16:19:23 >> StartTime=Unknown EndTime=Unknown Deadline=N/A >> SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-19T16:26:43 >> Partition=day AllocNode:Sid=cluster-1:2776451 >> ReqNodeList=(null) ExcNodeList=(null) >> NodeList=(null) >> NumNodes=3 NumCPUs=40 NumTasks=40 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >> TRES=cpu=40,node=1,billing=40 >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* >> MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 >> Features=(null) DelayBoot=00:00:00 >> OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) >> >> >> >> Why is the day job not preempting the night job? >> >>