Following up with a bit more specific color as to what I’m seeing, as well as a 
solution that I’m ashamed I didn’t come back to it. 

If there is exclusively tier3 work queued up, gang scheduling never comes into 
play.
If there is tier3+tier1 work queued up, tier1 gets requeued, and tier3 preempts 
as expected.
If enough work is queued in tier3 that it then triggers a suspend preemption in 
tier2, thats when things fall over and gang scheduling starts happening inside 
of tier3 queue.

So the issue seems to have stemmed from my use of OverSubscribe=FORCE:1 in my 
tier3 partition (separate from the tier1/2 partition).
This was set in anticipation of increasing the forced oversubscription limit in 
the future, but wanting to keep oversubscription “off” for now.
However, by setting OverSubscribe=NO on the tier3 partition, and leaving 
OverSubscribe=FORCE:1 on the tier1/2 partition.

So, this gets me to where I wanted to be in the first place, which is tier3 not 
gang scheduling, while still allowing tier1/tier2 to be requeued/suspended.
So I answered my own question, and hopefully someone will benefit from this.

Reed

> On Aug 8, 2022, at 11:27 AM, Reed Dier <reed.d...@focusvq.com> wrote:
> 
> I’ve got essentially 3 “tiers” of jobs.
> 
> tier1 are stateless and can be requeued
> tier2 are stateful and can be suspended
> tier3 are “high priority” and can preempt tier1 and tier2 with the requisite 
> preemption modes.
> 
>> $ sacctmgr show qos format=name%10,priority%10,preempt%12,preemptmode%10
>>       Name   Priority      Preempt PreemptMod
>> ---------- ---------- ------------ ----------
>>     normal          0                 cluster
>>      tier1         10                 requeue
>>      tier2         10                 suspend
>>      tier3        100  tier1,tier2    cluster
> 
> I also have a separate partition for the same hardware nodes to allow for 
> tier3 to cross partitions to suspend tier2 (if its possible to have this all 
> work in a single partition, please let me know).
> 
> tier1 and tier2 get preempted by tier3 perfectly, but the problem is now that 
> tier3 gets gang scheduled in times of big queues in tier3, when I never want 
> gang scheduling anywhere, but especially not tier3.
> 
>> PreemptType=preempt/qos
>> PreemptMode=SUSPEND,GANG
> 
> This is what is in my slurm.conf, because if I try to set 
> PreemptMode=SUSPEND, the ctld won’t start due to:
>> slurmctld: error: PreemptMode=SUSPEND requires GANG too
> 
> I have also tried to set PreemptMode=OFF in the (tier3) partition as well, 
> but this has had no effect on gang scheduling that I can see.
> 
> Right now, my hit-it-with-a-hammer solution is increasing SchedulerTimeSlice 
> to 65535 that should effectively prevent jobs from gang scheduling.
> While this effectively gets me to the goal I’m looking for, it's inelegant, 
> and if I end up with jobs that go past ~18 hours, this is not going to work 
> as I want/hope/expect.
> 
> So I’m hoping that there is a better solution to this that would solve the 
> root issue to have the tier3 qos/partition not preempt itself.
> 
> Hopefully I’ve described this well enough and someone can offer some pointers 
> on how to have suspend-able jobs in tier2, without having incidental 
> gang-suspension in tier3.
> 
> This is 21.08.8-2 in the production cluster, and I’m testing 22.05.2 in my 
> testing cluster which is behaving the same way.
> 
> Reed

Reply via email to