Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs

Renfro, Michael Tue, 16 Jun 2020 06:55:16 -0700

Not trying to argue unnecessarily, but what you describe is not a universal 
rule, regardless of QOS.


Our GPU nodes are members of 3 GPU-related partitions, 2 more resource-limited 
non-GPU partitions, and one of two larger-memory partitions. It’s set up this 
way to minimize idle resources (due to us not buying enough GPUs in those nodes 
to keep all the CPUs busy, plus our other nodes having limited numbers of DIMM 
slots for larger-memory jobs).

First terminal, results in a job running in the ‘any-interactive’ partition on 
gpunode002. We have a job submit plugin that automatically routes jobs to 
‘interactive’, ‘gpu-interactive’, or ‘any-interactive’ depending on the 
resources requested:

=====

[renfro@login rosetta-job]$ type hpcshell
hpcshell is a function
hpcshell ()
{
    srun --partition=interactive $@ --pty bash -i
}
[renfro@login rosetta-job]$ hpcshell
[renfro@gpunode002(job 751070) rosetta-job]$

=====

Second terminal, simultaneous to first terminal, results in a job running in 
the ‘gpu-interactive’ partition on gpunode002:

=====

[renfro@login ~]$ hpcshell --gres=gpu
[renfro@gpunode002(job 751071) ~]$ squeue -t R -u $USER
JOBID  PARTI       NAME       USER ST         TIME S:C: NODES MIN_MEMORY 
NODELIST(REASON)         SUBMIT_TIME          START_TIME            END_TIME 
TRES_PER_NODE
751071 gpu-i       bash     renfro  R         0:08 *:*: 1     2000M      
gpunode002       2020-06-16T08:27:50 2020-06-16T08:27:50 2020-06-16T10:27:50 gpu
751070 any-i       bash     renfro  R         0:18 *:*: 1     2000M      
gpunode002       2020-06-16T08:27:40 2020-06-16T08:27:40 2020-06-16T10:27:41 N/A
[renfro@gpunode002(job 751071) ~]$

=====

Selected configuration details (excluding things like resource ranges and 
defaults):

NodeName=gpunode[001-003]  CoresPerSocket=14 RealMemory=382000 Sockets=2 
ThreadsPerCore=1 Weight=10011 Gres=gpu:2
NodeName=gpunode004  CoresPerSocket=14 RealMemory=894000 Sockets=2 
ThreadsPerCore=1 Weight=10021 Gres=gpu:2

PartitionName=gpu Default=NO MaxCPUsPerNode=16 ExclusiveUser=NO State=UP 
Nodes=gpunode[001-004]
PartitionName=gpu-debug Default=NO MaxCPUsPerNode=16 ExclusiveUser=NO State=UP 
Nodes=gpunode[001-004]
PartitionName=gpu-interactive Default=NO MaxCPUsPerNode=16 ExclusiveUser=NO 
State=UP Nodes=gpunode[001-004]
PartitionName=any-interactive Default=NO MaxCPUsPerNode=12 ExclusiveUser=NO 
State=UP Nodes=node[001-040],gpunode[001-004]
PartitionName=any-debug Default=NO MaxCPUsPerNode=12 ExclusiveUser=NO State=UP 
Nodes=node[001-040],gpunode[001-004]
PartitionName=bigmem Default=NO MaxCPUsPerNode=12 ExclusiveUser=NO State=UP 
Nodes=gpunode[001-003]
PartitionName=hugemem Default=NO MaxCPUsPerNode=12 ExclusiveUser=NO State=UP 
Nodes=gpunode004

> On Jun 16, 2020, at 8:14 AM, Diego Zuccato <diego.zucc...@unibo.it> wrote:
> 
> Il 16/06/20 09:39, Loris Bennett ha scritto:
> 
>>> Maybe it's already known and obvious, but... Remember that a node can be
>>> allocated to only one partition.
>> Maybe I am misunderstanding you, but I think that this is not the case.
>> A node can be in multiple partitions.
> *Assigned* to multiple partitions: OK.
> But once slurm schedules jon in "partGPU" on that node, the whole node
> is unavailable for jobs in "partCPU", even if the GPU job is using only
> 1% of the resources.
> 
>> We have nodes belonging to
>> individual research groups which are in both a separate partition just
>> for the group and in a 'scavenger' partition for everyone (but with
>> lower priority add maximum run-time).
> More or less our current config. Quite inefficient, at least for us: too
> many unuseable resources due to small jobs.
> 
>>> So, if you have the mixed nodes in bot
>>> partitions and there's a GPU job running, a non-gpu job will find that
>>> node marked as busy because it's allocated to another partition.
>>> That's why we're drastically reducing the number of partitions we have
>>> and will avoid shared nodes.
>> Again I don't this is explanation.  If a job is running on a GPU node,
>> but not using all the CPUs, then a CPU-only job should be able to start
>> on that node, unless some form of exclusivity has been set up, such as
>> ExclusiveUser=YES for the partition.
> Nope. The whole node gets allocated to one partition at a time. So if
> the GPU job and the CPU one are in different partitions, it's expected
> that only one starts. The behaviour you're looking for is the one of
> QoS: define a single partition w/ multiple QoS and both jobs will run
> concurrently.
> 
> If you think about it, that's the meaning of "partition" :)
> 
> --
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786
>

Re: [slurm-users] ignore gpu resources to scheduled the cpu based jobs

Reply via email to