Re: [slurm-users] Q about setting up CPU limits

Carsten Beyer Thu, 23 Sep 2021 04:20:54 -0700

Hi Dj,

the solution could be in two QOS. We use something similar to restrict usage of GPU nodes (MaxTresPU=node=2). Examples below are from our Testcluster.

1) create a QOS with e.g. MaxTresPU=cpu=200 and assign it to your partition, e.g.


[root@bta0 ~]# sacctmgr -s show qos maxcpu format=Name,MaxTRESPU
      Name     MaxTRESPU
---------- -------------
    maxcpu        cpu=10
[root@bta0 ~]#
[root@bta0 ~]# scontrol show part maxtresputest
PartitionName=maxtresputest
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=maxcpu

If a user submits jobs requesting more cpus his (new) jobs get 'QOSMaxCpuPerUserLimit' in squeue.


kxxxxxx@btlogin1% squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 125316 maxtrespu maxsubmi kxxxxxx PD 0:00 1 (QOSMaxCpuPerUserLimit) 125317 maxtrespu maxsubmi kxxxxxx PD 0:00 1 (QOSMaxCpuPerUserLimit)

            125305 maxtrespu maxsubmi  kxxxxxx  R 0:45      1 btc30
            125306 maxtrespu maxsubmi  kxxxxxx  R 0:45      1 btc30

2) create a second QOS with Flags=DenyOnLimit,OverPartQoS and MaxTresPU=400. Assign it to a user that should overcome the limit of 200 cpus, but he will be limited then to 400. That user has to use this QOS, when submiting new jobs, e.g.


[root@bta0 ~]# sacctmgr -s show qos overpart format=Name,Flags%30,MaxTRESPU
      Name                          Flags     MaxTRESPU
---------- ------------------------------ -------------
  overpart        DenyOnLimit,OverPartQOS        cpu=40


Cheers,
Carsten

--
Carsten Beyer
Abteilung Systeme

Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45a * D-20146 Hamburg * Germany

Phone:  +49 40 460094-221
Fax:    +49 40 460094-270
Email:  be...@dkrz.de
URL:    http://www.dkrz.de

Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784



Am 22.09.2021 um 20:57 schrieb Dj Merrill:

Hi all,
I'm relatively new to Slurm and my Internet searches so far have turned up lots of examples from the client perspective, but not from the admin perspective on how to set this up, and I'm hoping someone can point us in the right direction. This should be pretty simple... :-)
We have a test cluster running Slurm 21.08.1 and are trying to figure out how to set a limit of 200 CPU cores that can be requested in a partition. Basically, if someone submits a thousand single CPU core jobs, it should run 200 of them and the other 800 will wait in the queue until 1 is finished, then run their next job from the queue, etc, or if someone has a 180 CPU core job running and they submit a 30 CPU core job, it should wait in the queue until the 180 core job finishes. If someone submits a job requesting 201 CPU cores, it should fail and give an error.
According to the Slurm resource limits hierarchy, if a partition limit is set, we should be able to setup a user association to override it in the case where we might want someone to be able to access 300 CPU cores in that partition, for example.
I can see in the Slurm documentation how to setup max nodes per partition, but have not been able to find how to do this with CPU cores.
My questions are:
1) How do we setup a CPU core limit on a partition that applies to all users?
2) How do we setup a user association to allow a single person to use more than the default CPU core limit set on the partition?
3) Is there a better way to accomplish this than the method I'm asking?
For reference, Slurm accounting is setup, GPU allocations are working properly, and I think we are close but just missing something obvious to setup the CPU core limits.
Thank you,


-Dj

smime.p7s
Description: S/MIME Cryptographic Signature

Re: [slurm-users] Q about setting up CPU limits

Reply via email to