Hello,

we found an issue with Slurm 24.05.1 and the MaxMemPerNode
setting. Slurm is installed in a single workstation, and thus, the
number of nodes is just 1.

The relevant sections in slurm.conf read:

,----
| EnforcePartLimits=ALL
| PartitionName=short       Nodes=..... State=UP Default=YES MaxTime=2-00:00:00 
 MaxCPUsPerNode=76  MaxMemPerNode=231000 OverSubscribe=FORCE:1
`----

Now, if I submit a job requesting 76 CPUs and each one needing 4000M
(for a total of 304000M), Slurm does indeed respect the MaxMemPerNode
setting and the job is not submitted in the following cases ("-N 1" is
not really necessary, as there is only one node):

,----
| $ sbatch -N 1 -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch
| sbatch: error: Batch job submission failed: Memory required by task is not 
available
| 
| $ sbatch -N 1 -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch
| sbatch: error: Batch job submission failed: Memory required by task is not 
available
| 
| $ sbatch -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch
| sbatch: error: Batch job submission failed: Memory required by task is not 
available
`----


But with this submission Slurm is happy:

,----
| $ sbatch -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch
| Submitted batch job 133982
`----

and the slurmjobcomp.log file does indeed tell me that the memory went
above MaxMemPerNode:

,----
| JobId=133982 UserId=......(10487) GroupId=domain users(2000) Name=test 
JobState=CANCELLED Partition=short TimeLimit=45 StartTime=2024-09-04T09:11:17 
EndTime=2024-09-04T09:11:24 NodeList=...... NodeCnt=1 ProcCnt=76 WorkDir=/tmp/. 
ReservationName= Tres=cpu=76,mem=304000M,node=1,billing=76 Account=ddgroup 
QOS=domino WcKey= Cluster=...... SubmitTime=2024-09-04T09:11:17 
EligibleTime=2024-09-04T09:11:17 DerivedExitCode=0:0 ExitCode=0:0
`----


What is the best way to report issues like this to the Slurm developers?
I thought of adding it to https://support.schedmd.com/ but it is not
clear to me if that page is only meant for Slurm users with a Support
Contract? 

Cheers,
-- 
Ángel de Vicente  
 Research Software Engineer (Supercomputing and BigData)
 Instituto de Astrofísica de Canarias (https://www.iac.es/en)


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to