After trying to approach this with preempt/partition_prio, we ended up moving to QOS based preemption due to some issues with suspend/requeue, and also wanting to use QOS for quicker/easier tweaks than changing partitions as a whole.
> PreemptType=preempt/qos > PreemptMode=SUSPEND,GANG > PartitionName=part-lopri Nodes=nodes[000-NNN] Default=NO > MaxTime=INFINITE OverSubscribe=FORCE:1 PriorityTier=10 State=UP > PartitionName=part-hipir Nodes=nodes[000-NNN] Default=NO > MaxTime=INFINITE OverSubscribe=NO PriorityTier=100 State=UP > PreemptMode=OFF We then have a few QOS that have different Priority values, as well as PreemptMode, QOS it can preempt, etc. > Name Priority Preempt PreemptMode > ---------- ---------- ---------- ----------- > rq 10 requeue > susp 11 suspend > hipri 100 rq,susp cluster > test 50 rq requeue The rq qos is stateless and can be requeued, susp qos is stateful and needs to be suspended. Hipri can preempt rq and susp. We also have a test qos with very strict limits (wall clock, job count, tres count, etc) that allows small jobs to jump the queue, for quick testing before submitting into the full queue. The tricky part for us was that we have some stateful jobs that need to be suspended, and some stateless jobs that can just be requeued without issue. But we want the hipri partition to take precedent, on the same hardware pool. We also didn’t want gang scheduling to flip flop jobs running, which if memory serves me correctly, was how/why we ended up going with duplicative partitions for the purpose of priority, because we couldn’t get preemption to work intra-partition correctly. In a perfect world, we would have just the single partition and everything handled in QOS, but it’s working, and that’s what mattered. I’m not sure how any of this would work with FORCE:20 oversubscribe, but hopefully it offers something useful to try next. Reed > On May 24, 2023, at 8:42 AM, Groner, Rob <rug...@psu.edu> wrote: > > What you are describing is definitely doable. We have our system setup > similarly. All nodes are in the "open" partition and "prio" partition, but a > job submitted to the "prio" partition will preempt the open jobs. > > I don't see anything clearly wrong with your slurm.conf settings. Ours are > very similar, though we use only FORCE:1 for oversubscribe. You might try > that just to see if there's a difference. > > What are the sbatch settings you are using when you submit the jobs? > > Do you have PreemptExemptTime set to anything in slurm.conf? > > What is the reason squeue gives for the high priority jobs to be pending? > > For your "run regularly" goal, you might consider scrontab. If we can figure > out priority and preemption, then that will start the job at a regular time. > > Rob > > From: slurm-users <slurm-users-boun...@lists.schedmd.com > <mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of Fabrizio Roccato > <f.rocc...@isac.cnr.it <mailto:f.rocc...@isac.cnr.it>> > Sent: Wednesday, May 24, 2023 7:17 AM > To: slurm-users@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com> > <slurm-users@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com>> > Subject: [slurm-users] hi-priority partition and preemption > > [You don't often get email from f.rocc...@isac.cnr.it > <mailto:f.rocc...@isac.cnr.it>. Learn why this is important at > https://aka.ms/LearnAboutSenderIdentification > <https://aka.ms/LearnAboutSenderIdentification> ] > > Hi all, > i'm trying to have two overlapping partition, say normal and hi-pri, > so that when jobs are launched in the second one they can preempt the jobs > allready running in the first one, automatically putting them in suspend > state. After completition, the jobs in the normal partition must be > automatically resumed. > > here are my (relevant) slurm.conf settings: > > > PreemptMode=suspend,gang > > PreemptType=preempt/partition_prio > > > > PartitionName=normal Nodes=node0[01-08] MaxTime=1800 PriorityTier=100 > > AllowAccounts=group1,group2 OverSubscribe=FORCE:20 PreemptMode=suspend > > PartitionName=hi-pri Nodes=node0[01-08] MaxTime=360 PriorityTier=500 > > AllowAccounts=group2 OverSubscribe=FORCE:20 PreemptMode=off > > But so, jobs in the hi-pri partition where put in PD state and the ones > allready running in the normal partition continue in their R status. > What i'm wrong? What i'm missing? > > Since i have jobs thath must run at specific time and must have priority over > all others, is this the correct way to do? > > > Thanks > > FR