Hi I think you want one of the "MaxTRESMins*" options:
MaxTRESMins=TRES=<minutes>[,TRES=<minutes>,...] MaxTRESMinsPJ=TRES=<minutes>[,TRES=<minutes>,...] MaxTRESMinsPerJob=TRES=<minutes>[,TRES=<minutes>,...] Maximum number of TRES minutes each job is able to use in this association. This is overridden if set directly on a user. Default is the cluster's limit. To clear a previously set value use the modify command with a new value of -1 for each TRES id. - sacctmgr(1) The "MaxCPUs" is a limit on the number of CPUs the association can use. -- Michael On Fri, Apr 18, 2025 at 8:01 AM Patrick Begou via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hi all, > > I'm trying to setup a QoS on a small 5 nodes cluster running slurm > 24.05.7. My goal is to limit the resources on a (time x number of cores) > strategy to avoid one large job requesting all the resources for too > long time. I've read from https://slurm.schedmd.com/qos.html and some > discussion but my setup is still not working. > > I think I need to set these informations: > MaxCPUsPerJob=172800 > MaxWallDurationPerJob=48:00:00 > Flags=DenyOnLimit,OverPartQOS > > for: > 12h max for 240 cores => (12*240*60=172800mn) > no job can exceed 2 days > do not accept jobs out of these limits. > > What I've done: > > 1) create the QoS: > sudo sacctmgr add qos workflowlimit \ > MaxWallDurationPerJob=48:00:00 \ > MaxCPUsPerJob=172800 \ > Flags=DenyOnLimit,OverPartQOS > > > 2) Check > sacctmgr show qos Name=workflowlimit format=Name%16,MaxTRES,MaxWall > Name MaxTRES MaxWall > ---------------- ------------- ----------- > workflowlimit cpu=172800 2-00:00:00 > > 3) Set the QoS for the account "most" which is the default account for > the users: > sudo sacctmgr modify account name=most set qos=workflowlimit > > 4) Check > $ sacctmgr show assoc format=account,cluster,user,qos > Account Cluster User QOS > ---------- ---------- ---------- -------------------- > root osorno normal > root osorno root normal > legi osorno normal > most osorno workflowlimit > most osorno begou workflowlimit > > 5) Modifiy slurm.conf with: > AccountingStorageEnforce=limits,qos > and propagate on the 5 nodes and the front end (done via Ansible) > > 6) Check > clush -b -w osorno-fe,osorno,osorno-0-[0-4] 'grep > AccountingStorageEnforce /etc/slurm/slurm.conf' > --------------- > osorno,osorno-0-[0-4],osorno-fe (7) > --------------- > AccountingStorageEnforce=limits,qos > > 7) restart slurmd on all the compute nodes and slurmctld + slurmdbd on > the management node. > > But I can still request 400 cores for 24 hours: > [begou@osorno ~]$ srun -n 400 -t 24:0:0 --pty bash > bash-5.1$ squeue > JOBID PARTITION NAME USER ST TIME > START_TIME TIME_LIMIT CPUS NODELIST(REASON) > 147 genoa bash begou R 0:03 > 2025-04-18T16:52:11 1-00:00:00 400 osorno-0-[0-4] > > So I must have missed something ? > > My partition (I've only one) in slurm.conf is: > PartitionName=genoa State=UP Default=YES MaxTime=48:00:00 > DefaultTime=24:00:00 Shared=YES OverSubscribe=NO Nodes=osorno-0-[0-4] > > Thanks > > Patrick > > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com >
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com