Hi Dj,the solution could be in two QOS. We use something similar to restrict usage of GPU nodes (MaxTresPU=node=2). Examples below are from our Testcluster.
1) create a QOS with e.g. MaxTresPU=cpu=200 and assign it to your partition, e.g.
[root@bta0 ~]# sacctmgr -s show qos maxcpu format=Name,MaxTRESPU Name MaxTRESPU ---------- ------------- maxcpu cpu=10 [root@bta0 ~]# [root@bta0 ~]# scontrol show part maxtresputest PartitionName=maxtresputest AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL AllocNodes=ALL Default=NO QoS=maxcpuIf a user submits jobs requesting more cpus his (new) jobs get 'QOSMaxCpuPerUserLimit' in squeue.
kxxxxxx@btlogin1% squeueJOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 125316 maxtrespu maxsubmi kxxxxxx PD 0:00 1 (QOSMaxCpuPerUserLimit) 125317 maxtrespu maxsubmi kxxxxxx PD 0:00 1 (QOSMaxCpuPerUserLimit)
125305 maxtrespu maxsubmi kxxxxxx R 0:45 1 btc30 125306 maxtrespu maxsubmi kxxxxxx R 0:45 1 btc302) create a second QOS with Flags=DenyOnLimit,OverPartQoS and MaxTresPU=400. Assign it to a user that should overcome the limit of 200 cpus, but he will be limited then to 400. That user has to use this QOS, when submiting new jobs, e.g.
[root@bta0 ~]# sacctmgr -s show qos overpart format=Name,Flags%30,MaxTRESPU Name Flags MaxTRESPU ---------- ------------------------------ ------------- overpart DenyOnLimit,OverPartQOS cpu=40 Cheers, Carsten -- Carsten Beyer Abteilung Systeme Deutsches Klimarechenzentrum GmbH (DKRZ) Bundesstraße 45a * D-20146 Hamburg * Germany Phone: +49 40 460094-221 Fax: +49 40 460094-270 Email: be...@dkrz.de URL: http://www.dkrz.de Geschäftsführer: Prof. Dr. Thomas Ludwig Sitz der Gesellschaft: Hamburg Amtsgericht Hamburg HRB 39784 Am 22.09.2021 um 20:57 schrieb Dj Merrill:
Hi all,I'm relatively new to Slurm and my Internet searches so far have turned up lots of examples from the client perspective, but not from the admin perspective on how to set this up, and I'm hoping someone can point us in the right direction. This should be pretty simple... :-)We have a test cluster running Slurm 21.08.1 and are trying to figure out how to set a limit of 200 CPU cores that can be requested in a partition. Basically, if someone submits a thousand single CPU core jobs, it should run 200 of them and the other 800 will wait in the queue until 1 is finished, then run their next job from the queue, etc, or if someone has a 180 CPU core job running and they submit a 30 CPU core job, it should wait in the queue until the 180 core job finishes. If someone submits a job requesting 201 CPU cores, it should fail and give an error.According to the Slurm resource limits hierarchy, if a partition limit is set, we should be able to setup a user association to override it in the case where we might want someone to be able to access 300 CPU cores in that partition, for example.I can see in the Slurm documentation how to setup max nodes per partition, but have not been able to find how to do this with CPU cores.My questions are:1) How do we setup a CPU core limit on a partition that applies to all users?2) How do we setup a user association to allow a single person to use more than the default CPU core limit set on the partition?3) Is there a better way to accomplish this than the method I'm asking?For reference, Slurm accounting is setup, GPU allocations are working properly, and I think we are close but just missing something obvious to setup the CPU core limits.Thank you, -Dj
smime.p7s
Description: S/MIME Cryptographic Signature