Re: [slurm-users] hi-priority partition and preemption

Reed Dier Thu, 25 May 2023 08:58:00 -0700

After trying to approach this with preempt/partition_prio, we ended up moving 
to QOS based preemption due to some issues with suspend/requeue, and also 
wanting to use QOS for quicker/easier tweaks than changing partitions as a 
whole.


> PreemptType=preempt/qos
> PreemptMode=SUSPEND,GANG
> PartitionName=part-lopri     Nodes=nodes[000-NNN]    Default=NO          
> MaxTime=INFINITE    OverSubscribe=FORCE:1   PriorityTier=10     State=UP
> PartitionName=part-hipir     Nodes=nodes[000-NNN]    Default=NO          
> MaxTime=INFINITE    OverSubscribe=NO        PriorityTier=100    State=UP      
>   PreemptMode=OFF

We then have a few QOS that have different Priority values, as well as 
PreemptMode, QOS it can preempt, etc.
>       Name   Priority    Preempt PreemptMode
> ---------- ---------- ---------- -----------
>         rq         10                requeue
>       susp         11                suspend 
>      hipri        100    rq,susp     cluster
>       test         50         rq     requeue

The rq qos is stateless and can be requeued, susp qos is stateful and needs to 
be suspended.
Hipri can preempt rq and susp.
We also have a test qos with very strict limits (wall clock, job count, tres 
count, etc) that allows small jobs to jump the queue, for quick testing before 
submitting into the full queue.

The tricky part for us was that we have some stateful jobs that need to be 
suspended, and some stateless jobs that can just be requeued without issue.
But we want the hipri partition to take precedent, on the same hardware pool.
We also didn’t want gang scheduling to flip flop jobs running, which if memory 
serves me correctly, was how/why we ended up going with duplicative partitions 
for the purpose of priority, because we couldn’t get preemption to work 
intra-partition correctly.
In a perfect world, we would have just the single partition and everything 
handled in QOS, but it’s working, and that’s what mattered.

I’m not sure how any of this would work with FORCE:20 oversubscribe, but 
hopefully it offers something useful to try next.

Reed

> On May 24, 2023, at 8:42 AM, Groner, Rob <rug...@psu.edu> wrote:
> 
> What you are describing is definitely doable.  We have our system setup 
> similarly.  All nodes are in the "open" partition and "prio" partition, but a 
> job submitted to the "prio" partition will preempt the open jobs.
> 
> I don't see anything clearly wrong with your slurm.conf settings.  Ours are 
> very similar, though we use only FORCE:1 for oversubscribe.  You might try 
> that just to see if there's a difference.
> 
> What are the sbatch settings you are using when you submit the jobs?
> 
> Do you have PreemptExemptTime set to anything in slurm.conf?
> 
> What is the reason squeue gives for the high priority jobs to be pending?
> 
> For your "run regularly" goal, you might consider scrontab.  If we can figure 
> out priority and preemption, then that will start the job at a regular time.
> 
> Rob
> 
> From: slurm-users <slurm-users-boun...@lists.schedmd.com 
> <mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of Fabrizio Roccato 
> <f.rocc...@isac.cnr.it <mailto:f.rocc...@isac.cnr.it>>
> Sent: Wednesday, May 24, 2023 7:17 AM
> To: slurm-users@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com> 
> <slurm-users@lists.schedmd.com <mailto:slurm-users@lists.schedmd.com>>
> Subject: [slurm-users] hi-priority partition and preemption
>  
> [You don't often get email from f.rocc...@isac.cnr.it 
> <mailto:f.rocc...@isac.cnr.it>. Learn why this is important at 
> https://aka.ms/LearnAboutSenderIdentification 
> <https://aka.ms/LearnAboutSenderIdentification> ]
> 
> Hi all,
>         i'm trying to have two overlapping partition, say normal and hi-pri,
> so that when jobs are launched in the second one they can preempt the jobs
> allready running in the first one, automatically putting them in suspend
> state. After completition, the jobs in the normal partition must be
> automatically resumed.
> 
> here are my (relevant) slurm.conf settings:
> 
> > PreemptMode=suspend,gang
> > PreemptType=preempt/partition_prio
> >
> > PartitionName=normal Nodes=node0[01-08] MaxTime=1800 PriorityTier=100 
> > AllowAccounts=group1,group2 OverSubscribe=FORCE:20 PreemptMode=suspend
> > PartitionName=hi-pri Nodes=node0[01-08] MaxTime=360 PriorityTier=500 
> > AllowAccounts=group2 OverSubscribe=FORCE:20 PreemptMode=off
> 
> But so, jobs in the hi-pri partition where put in PD state and the ones
> allready running in the normal partition continue in their R status.
> What  i'm wrong? What i'm missing?
> 
> Since i have jobs thath must run at specific time and must have priority over
> all others, is this the correct way to do?
> 
> 
> Thanks
> 
> FR

Re: [slurm-users] hi-priority partition and preemption

Reply via email to