Re: [slurm-users] [External] Submitting to multiple paritions problem with gres specified

Bas van der Vlies Tue, 09 Mar 2021 06:12:30 -0800

For those who are interested:
 * https://bugs.schedmd.com/show_bug.cgi?id=11044


On 09/03/2021 14:21, Bas van der Vlies wrote:

I have found the problem and will submit a patch. If we find a partitionwere a job can run but all nodes are busy. Save this state and returnthis when all partitions are checked and job can not run in any.
Do not know if this is the right approach

regards

On 09/03/2021 09:45, Bas van der Vlies wrote:
Hi Prentice,

Ansers inline

On 08/03/2021 22:02, Prentice Bisbal wrote:
Rather than specifying the processor types as GRES, I wouldrecommending defining them as features of the nodes and let the usersspecify the features as constraints to their jobs. Since the newerprocessors are backwards compatible with the older processors, listthe older processors as features of the newer nodes, too.
We already do this with features on our other cluster. We assign nodes
different feature and user select these. I can add a new feature ofwhich cpu type it is. Sometime you want avx512 and specific processor.
On other cluster we have 5 different GPU's and a lot of partitions. Iwant to make it simple for our users. So we have a 'job_submit.lua'script that submits to multiple parttions and if the user specify theGRES type then slurm selects the right partition(s)
On this cluster we do not have GPU's but i can test with other GRES type
'cpu_type'. And I think the last partition in the list determines thebehavior. So if a use a GRES that is supported by the last partitionthe job gets queued:
  * srun -N1  --gres=cpu_type:e5_2650_v2 --pty /bin/bash
  * srun --exclusive  --gres=cpu_type:e5_2650_v2 --pty /bin/bash
srun: job 1865 queued and waiting for resources
So to me it seems that one of the partition is BUSY but can run thejob. I will test it on our GPU cluster but expect the same behaviour.
If you want to continue down the road you've already started on, canyou provide more information, like the partition definitions and thegres definitions? In general, Slurm should support submitting tomultiple partitions.
slurm.conf
```PartitionName=cpu_e5_2650_v1 DefMemPerCPU=11000 Default=NoDefaultTime=5 DisableRootJobs=YES MaxNodes=2 MaxTime=5-00Nodes=r16n[18-20] OverSubscribe=EXCLUSIVE QOS=normal State=UP
PartitionName=cpu_e5_2650_v2 DefMemPerCPU=11000 Default=NoDefaultTime=5 DisableRootJobs=YES MaxNodes=2 MaxTime=5-00Nodes=r16n[21-22] OverSubscribe=EXCLUSIVE QOS=normal State=UP
NodeName=r16n18 CoresPerSocket=8 Features=sandybridge,sse4,avxGres=cpu_type:e5_2650_v1:no_consume:4T MemSpecLimit=1024NodeHostname=r16n18.mona.surfsara.nl RealMemory=188000 Sockets=2State=UNKNOWN ThreadsPerCore=1 Weight=10
NodeName=r16n21 CoresPerSocket=8 Features=sandybridge,sse4,avxGres=cpu_type:e5_2650_v2:no_consume:4T MemSpecLimit=1024NodeHostname=r16n21.mona.surfsara.nl RealMemory=188000 Sockets=2State=UNKNOWN ThreadsPerCore=1 Weight=10
gres.conf
NodeName=r16n[18-20] Count=4T Flags=CountOnly Name=cpu_typeType=e5_2650_v1 NodeName=r16n[21-22] Count=4T Flags=CountOnlyName=cpu_type Type=e5_2650_v2
Prentice

On 3/8/21 11:29 AM, Bas van der Vlies wrote:
Hi,
On this cluster I have version 20.02.6 installed. We have differentpartitions for cpu type and gpu types. we want to make it easy forthe user who not care where there job runs and for the experienceduser they can specify the gres type: cpu_type or gpu
I have defined 2 cpu partitions:
 * cpu_e5_2650_v1
 * cpu_e5_2650_v2

and 2 gres cpu_type:
 * e5_2650_v1
 * e5_2650_v2


When no partitions are specified it will submit to both partitions:
* srun --exclusive --gres=cpu_type:e5_2650_v1 --pty /bin/bash -->r16n18 wich has defined this gres and is in partition cpu_e5_2650_v1
Now I submit at the same time another job:
 * srun --exclusive  --gres=cpu_type:e5_2650_v1  --pty /bin/bash
This fails with: `srun: error: Unable to allocate resources:Requested node configuration is not available`
I would expect it gets queued in the partition `cpu_e5_2650_v1`.


When I specify the partition on the command line:
* srun --exclusive -p cpu_e5_2650_v1_shared--gres=cpu_type:e5_2650_v1 --pty /bin/bash
srun: job 1856 queued and waiting for resources
So the question is can slurm handle submitting to multiplepartitions when we specify gres attributes?
Regards


--
Bas van der Vlies

| Science Park 140 | 1098 XG Amsterdam | Phone: +31208001300 |
|  bas.vandervl...@surf.nl

Re: [slurm-users] [External] Submitting to multiple paritions problem with gres specified

Reply via email to