I’ve got essentially 3 “tiers” of jobs.

tier1 are stateless and can be requeued
tier2 are stateful and can be suspended
tier3 are “high priority” and can preempt tier1 and tier2 with the requisite 
preemption modes.

> $ sacctmgr show qos format=name%10,priority%10,preempt%12,preemptmode%10
>       Name   Priority      Preempt PreemptMod
> ---------- ---------- ------------ ----------
>     normal          0                 cluster
>      tier1         10                 requeue
>      tier2         10                 suspend
>      tier3        100  tier1,tier2    cluster

I also have a separate partition for the same hardware nodes to allow for tier3 
to cross partitions to suspend tier2 (if its possible to have this all work in 
a single partition, please let me know).

tier1 and tier2 get preempted by tier3 perfectly, but the problem is now that 
tier3 gets gang scheduled in times of big queues in tier3, when I never want 
gang scheduling anywhere, but especially not tier3.

> PreemptType=preempt/qos
> PreemptMode=SUSPEND,GANG

This is what is in my slurm.conf, because if I try to set PreemptMode=SUSPEND, 
the ctld won’t start due to:
> slurmctld: error: PreemptMode=SUSPEND requires GANG too

I have also tried to set PreemptMode=OFF in the (tier3) partition as well, but 
this has had no effect on gang scheduling that I can see.

Right now, my hit-it-with-a-hammer solution is increasing SchedulerTimeSlice to 
65535 that should effectively prevent jobs from gang scheduling.
While this effectively gets me to the goal I’m looking for, it's inelegant, and 
if I end up with jobs that go past ~18 hours, this is not going to work as I 
want/hope/expect.

So I’m hoping that there is a better solution to this that would solve the root 
issue to have the tier3 qos/partition not preempt itself.

Hopefully I’ve described this well enough and someone can offer some pointers 
on how to have suspend-able jobs in tier2, without having incidental 
gang-suspension in tier3.

This is 21.08.8-2 in the production cluster, and I’m testing 22.05.2 in my 
testing cluster which is behaving the same way.

Reed

Reply via email to