“The SchedulerType configuration parameter specifies the scheduler plugin to 
use. Options are sched/backfill, which performs backfill scheduling, and 
sched/builtin, which attempts to schedule jobs in a strict priority order 
within each partition/queue.”

https://slurm.schedmd.com/sched_config.html

If you’re using the builtin scheduler, lower priority jobs have no way to run 
ahead of higher priority jobs. If you’re using the backfill scheduler, your 
jobs will need specific wall times specified, since the idea with backfill is 
to run lower priority jobs ahead of time if and only if they can complete 
without delaying the estimated start time of higher priority jobs.

On Jul 13, 2020, at 4:18 AM, navin srivastava <navin.alt...@gmail.com> wrote:

Hi Team,

We have separate partitions for the GPU nodes and only CPU nodes .

scenario: the jobs submitted in our environment is 4CPU+1GPU  as well as 4CPU 
only in  nodeGPUsmall and nodeGPUbig. so when all the GPU exhausted and rest 
other jobs are in queue waiting for the availability of GPU resources.the job 
submitted with only CPU is not going through even though plenty of CPU 
resources are available but the job which is only looking CPU, also on pend 
because of these GPU based jobs( priority of GPU jobs is higher than CPU one).

Is there any option here we can do,so that when all GPU resources are exhausted 
then it should allow the CPU jobs. Is there a way to deal with it? or some 
custom solution which we can think of.  There is no issue with CPU only 
partitions.

Below is the my slurm configuration file


NodeName=node[1-12] NodeAddr=node[1-12] Sockets=2 CoresPerSocket=10 
RealMemory=128833 State=UNKNOWN
NodeName=node[13-16] NodeAddr=node[13-16] Sockets=2 CoresPerSocket=10 
RealMemory=515954 Feature=HIGHMEM State=UNKNOWN
NodeName=node[28-32]  NodeAddr=node[28-32] Sockets=2 CoresPerSocket=28 
RealMemory=257389
NodeName=node[32-33]  NodeAddr=node[32-33] Sockets=2 CoresPerSocket=24 
RealMemory=773418
NodeName=node[17-27]  NodeAddr=node[17-27] Sockets=2 CoresPerSocket=18 
RealMemory=257687 Feature=K2200 Gres=gpu:2
NodeName=node[34]  NodeAddr=node34 Sockets=2 CoresPerSocket=24 
RealMemory=773410 Feature=RTX Gres=gpu:8


PartitionName=node Nodes=node[1-10,14-16,28-33,35]  Default=YES 
MaxTime=INFINITE State=UP Shared=YES
PartitionName=nodeGPUsmall Nodes=node[17-27]  Default=NO MaxTime=INFINITE 
State=UP Shared=YES
PartitionName=nodeGPUbig Nodes=node[34]  Default=NO MaxTime=INFINITE State=UP 
Shared=YES

Regards
Navin.


Reply via email to