[slurm-users] Partition, Qos Limits & Scheduling of large jobs

Muck, Katrin via slurm-users Wed, 28 Feb 2024 02:04:14 -0800

Hi everyone!



I read the slurm documentation about qos, resource limits, scheduling and 
priority now multiple times and even looked into the slurm source but I'm still 
not sure if I got everything correctly, so this is why I decided to ask here ...



The problem: we see the effect that sometimes larger jobs with e.g. 16 gpus in 
our (small) gpu queue get delayed and shifted back without any reason that is 
apparent to us. small jobs that only use e.g. 1 or 2 gpus get scheduled much 
quicker even though they have a runtime of 3 days ...



What we want to do:


- We have a number of nodes with 2x gpus that are usable by the users of our 
cluster

- Some of these nodes belong to so called 'private projects'. Private projects 
have higher priority than other projects. Attached to that is a contingent of 
nodes & "guaranteed" nodes e.g. they could have a contingent of 4 nodes (8 
gpus) and e.g. 2 "guaranteed" nodes (4 gpus)

- Guaranteed nodes are nodes that should always be kept idle for the private 
project, so users of the private project can immediately schedule work on those 
nodes

- The other nodes are shared with other projects in general if they are not "in 
use"


How we are currently doing this (it has history):

Lets assume we have 50 nodes and 100 gpus.

- We have a single partition for all gpu nodes (e.g. 50 nodes)
- Private projects have private queues with a very high priority and a gres 
limit of the number of gpus they reserved (e.g. 10 nodes -> 20 gpus)
- Normal projects only have access to the public queue and schedule work there.
- This public queue has an upper gres limit of "total number of gpus" - 
"guaranteed gpus of all private projects" (e.g. 50 - 10 nodes -> 40 nodes -> 80 
gpus).


Regarding the scheduler, we currently use the following settings:

SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_CORE_MEMORY,CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE
SchedulerParameters=defer,max_sched_time=4,default_queue_depth=1000,partition_job_depth=500,enable_user_top,bf_max_job_user=20,bf_interval=120,bf_window=4320,bf_resolution=1800,bf_continue

Partition/queue depth is deliberately set high at the moment to avoid problems 
with jobs not even being examined.


The problem in more detail:

One of the last jobs (16 gpus needed) we diagnosed had a approximate start time 
that was beyond all end times of running/scheduled jobs - like jobs on feb 22 
would release more than enough gpus so the job could have been immediately 
scheduled afterwards, but the start time was still feb 23. Priority wise the 
job had the highest priority of all pending jobs for the partition.

When we turned on scheduler debugging and increased log levels, we observed the 
following messages for this job:

JobId=xxxx being held, if allowed the job request will exceed QOS xxxxx group 
max tres(gres/gpu) limit yy with already used yy + requested 16

followed by

sched: JobId=2796696 delayed for accounting policy

So to us this meant that the scheduler was always hitting the qos limits, which 
makes sense because the usage is always very high in the gpu queue and thus the 
job wasn't scheduled ...

At first we were worried, that this meant that "held"/"delayed" jobs like this 
would never actually get scheduled when contention is high enough e.g. small 
jobs getting backfilled in and thus qos limits stay at max for a long time.

But for some reason we could not determine the job eventually got scheduled at 
one point and then ran at the scheduled start time.


Open Questions:
- why it couldn't be scheduled in the first place. initially we thought (from 
the source code i looked into) the "delayed for accounting policy" prevents 
further scheduling in general, but since it was scheduled this assumption must 
be wrong?
- why it was scheduled at some point. when it was scheduled, contention was 
still high and the qos limits definitely still applied
- how we could modify the current setup so that the scheduling of larger jobs 
becomes "better" and more reproducible/explainable


Apart from all of this I'm also asking myself if there is maybe a better way to 
setup a system that works the way we want?


This got a bit long but I hope its clear enough :)


Kind regards,
Katrin

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Partition, Qos Limits & Scheduling of large jobs

Reply via email to