[slurm-users] Re: Setting QoS with slurm 24.05.7

Michael Gutteridge via slurm-users Fri, 18 Apr 2025 08:24:27 -0700

Hi

I think you want one of the "MaxTRESMins*" options:


MaxTRESMins=TRES=<minutes>[,TRES=<minutes>,...]
MaxTRESMinsPJ=TRES=<minutes>[,TRES=<minutes>,...]
MaxTRESMinsPerJob=TRES=<minutes>[,TRES=<minutes>,...]
Maximum number of TRES minutes each job is able to use in this association.
This is overridden if set directly on a user. Default is the cluster's
limit. To clear a previously set value use the modify command with a new
value of -1 for each TRES id.

   - sacctmgr(1)

The "MaxCPUs" is a limit on the number of CPUs the association can use.

 -- Michael


On Fri, Apr 18, 2025 at 8:01 AM Patrick Begou via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hi all,
>
> I'm trying to setup a QoS on a small 5 nodes cluster running slurm
> 24.05.7. My goal is to limit the resources on a (time x number of cores)
> strategy to avoid one large job requesting all the resources for too
> long time. I've read from https://slurm.schedmd.com/qos.html and some
> discussion but my setup is still not working.
>
> I think I need to set these informations:
> MaxCPUsPerJob=172800
> MaxWallDurationPerJob=48:00:00
> Flags=DenyOnLimit,OverPartQOS
>
> for:
> 12h max for 240 cores => (12*240*60=172800mn)
> no job can exceed 2 days
> do not accept jobs out of these limits.
>
> What I've done:
>
> 1) create the QoS:
> sudo sacctmgr add qos workflowlimit \
>       MaxWallDurationPerJob=48:00:00 \
>       MaxCPUsPerJob=172800 \
>       Flags=DenyOnLimit,OverPartQOS
>
>
> 2) Check
> sacctmgr show qos Name=workflowlimit format=Name%16,MaxTRES,MaxWall
>                 Name       MaxTRES     MaxWall
>     ---------------- ------------- -----------
>        workflowlimit    cpu=172800  2-00:00:00
>
> 3) Set the QoS for the account "most" which is the default account for
> the users:
> sudo sacctmgr modify account name=most set qos=workflowlimit
>
> 4) Check
> $ sacctmgr show assoc format=account,cluster,user,qos
>     Account    Cluster       User                  QOS
> ---------- ---------- ---------- --------------------
>        root     osorno                          normal
>        root     osorno       root               normal
>        legi     osorno                          normal
>        most     osorno                   workflowlimit
>        most     osorno      begou        workflowlimit
>
> 5) Modifiy slurm.conf with:
>      AccountingStorageEnforce=limits,qos
> and propagate on the 5 nodes and the front end (done via Ansible)
>
> 6) Check
> clush -b -w osorno-fe,osorno,osorno-0-[0-4] 'grep
> AccountingStorageEnforce /etc/slurm/slurm.conf'
> ---------------
> osorno,osorno-0-[0-4],osorno-fe (7)
> ---------------
> AccountingStorageEnforce=limits,qos
>
> 7) restart slurmd on all the compute nodes and slurmctld + slurmdbd on
> the management node.
>
> But I can still request 400 cores for 24 hours:
> [begou@osorno ~]$ srun -n 400 -t 24:0:0 --pty bash
> bash-5.1$ squeue
>    JOBID        PARTITION               NAME       USER ST TIME
> START_TIME TIME_LIMIT CPUS NODELIST(REASON)
>      147            genoa               bash      begou  R 0:03
> 2025-04-18T16:52:11 1-00:00:00  400 osorno-0-[0-4]
>
> So I must have missed something ?
>
> My partition (I've only one) in slurm.conf is:
> PartitionName=genoa  State=UP Default=YES MaxTime=48:00:00
> DefaultTime=24:00:00 Shared=YES OverSubscribe=NO Nodes=osorno-0-[0-4]
>
> Thanks
>
> Patrick
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Setting QoS with slurm 24.05.7

Reply via email to