Following up with a bit more specific color as to what I’m seeing, as well as a solution that I’m ashamed I didn’t come back to it.
If there is exclusively tier3 work queued up, gang scheduling never comes into play. If there is tier3+tier1 work queued up, tier1 gets requeued, and tier3 preempts as expected. If enough work is queued in tier3 that it then triggers a suspend preemption in tier2, thats when things fall over and gang scheduling starts happening inside of tier3 queue. So the issue seems to have stemmed from my use of OverSubscribe=FORCE:1 in my tier3 partition (separate from the tier1/2 partition). This was set in anticipation of increasing the forced oversubscription limit in the future, but wanting to keep oversubscription “off” for now. However, by setting OverSubscribe=NO on the tier3 partition, and leaving OverSubscribe=FORCE:1 on the tier1/2 partition. So, this gets me to where I wanted to be in the first place, which is tier3 not gang scheduling, while still allowing tier1/tier2 to be requeued/suspended. So I answered my own question, and hopefully someone will benefit from this. Reed > On Aug 8, 2022, at 11:27 AM, Reed Dier <reed.d...@focusvq.com> wrote: > > I’ve got essentially 3 “tiers” of jobs. > > tier1 are stateless and can be requeued > tier2 are stateful and can be suspended > tier3 are “high priority” and can preempt tier1 and tier2 with the requisite > preemption modes. > >> $ sacctmgr show qos format=name%10,priority%10,preempt%12,preemptmode%10 >> Name Priority Preempt PreemptMod >> ---------- ---------- ------------ ---------- >> normal 0 cluster >> tier1 10 requeue >> tier2 10 suspend >> tier3 100 tier1,tier2 cluster > > I also have a separate partition for the same hardware nodes to allow for > tier3 to cross partitions to suspend tier2 (if its possible to have this all > work in a single partition, please let me know). > > tier1 and tier2 get preempted by tier3 perfectly, but the problem is now that > tier3 gets gang scheduled in times of big queues in tier3, when I never want > gang scheduling anywhere, but especially not tier3. > >> PreemptType=preempt/qos >> PreemptMode=SUSPEND,GANG > > This is what is in my slurm.conf, because if I try to set > PreemptMode=SUSPEND, the ctld won’t start due to: >> slurmctld: error: PreemptMode=SUSPEND requires GANG too > > I have also tried to set PreemptMode=OFF in the (tier3) partition as well, > but this has had no effect on gang scheduling that I can see. > > Right now, my hit-it-with-a-hammer solution is increasing SchedulerTimeSlice > to 65535 that should effectively prevent jobs from gang scheduling. > While this effectively gets me to the goal I’m looking for, it's inelegant, > and if I end up with jobs that go past ~18 hours, this is not going to work > as I want/hope/expect. > > So I’m hoping that there is a better solution to this that would solve the > root issue to have the tier3 qos/partition not preempt itself. > > Hopefully I’ve described this well enough and someone can offer some pointers > on how to have suspend-able jobs in tier2, without having incidental > gang-suspension in tier3. > > This is 21.08.8-2 in the production cluster, and I’m testing 22.05.2 in my > testing cluster which is behaving the same way. > > Reed