[slurm-users] Re: Bug? sbatch not respecting MaxMemPerNode setting

Brian Andrus via slurm-users Wed, 04 Sep 2024 11:25:39 -0700

Angel,

Unless you are using cgroups and constraints, there is no limit imposed.The numbers are used by slurm to track what is available, not what youmay/may not use. So you could tell slurm the node only has 1GB and itwill not let you request more than that, but if you do request only 1GB,without specific configuration, there is nothing stopping you from usingmore than that.

So your request did not exceed what slurm sees as available (1 cpu using4GB), so it is happy to let your script run. I suspect if you look atthe usage, you will see that 1 cpu spiked high while the others did nothing.


Brian Andrus

On 9/4/2024 1:37 AM, Angel de Vicente via slurm-users wrote:

Hello,

we found an issue with Slurm 24.05.1 and the MaxMemPerNode
setting. Slurm is installed in a single workstation, and thus, the
number of nodes is just 1.

The relevant sections in slurm.conf read:

,----
| EnforcePartLimits=ALL
| PartitionName=short       Nodes=..... State=UP Default=YES MaxTime=2-00:00:00 
 MaxCPUsPerNode=76  MaxMemPerNode=231000 OverSubscribe=FORCE:1
`----

Now, if I submit a job requesting 76 CPUs and each one needing 4000M
(for a total of 304000M), Slurm does indeed respect the MaxMemPerNode
setting and the job is not submitted in the following cases ("-N 1" is
not really necessary, as there is only one node):

,----
| $ sbatch -N 1 -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch
| sbatch: error: Batch job submission failed: Memory required by task is not 
available
|
| $ sbatch -N 1 -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch
| sbatch: error: Batch job submission failed: Memory required by task is not 
available
|
| $ sbatch -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch
| sbatch: error: Batch job submission failed: Memory required by task is not 
available
`----


But with this submission Slurm is happy:

,----
| $ sbatch -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch
| Submitted batch job 133982
`----

and the slurmjobcomp.log file does indeed tell me that the memory went
above MaxMemPerNode:

,----
| JobId=133982 UserId=......(10487) GroupId=domain users(2000) Name=test 
JobState=CANCELLED Partition=short TimeLimit=45 StartTime=2024-09-04T09:11:17 
EndTime=2024-09-04T09:11:24 NodeList=...... NodeCnt=1 ProcCnt=76 WorkDir=/tmp/. 
ReservationName= Tres=cpu=76,mem=304000M,node=1,billing=76 Account=ddgroup 
QOS=domino WcKey= Cluster=...... SubmitTime=2024-09-04T09:11:17 
EligibleTime=2024-09-04T09:11:17 DerivedExitCode=0:0 ExitCode=0:0
`----


What is the best way to report issues like this to the Slurm developers?
I thought of adding it to https://support.schedmd.com/ but it is not
clear to me if that page is only meant for Slurm users with a Support
Contract?

Cheers,


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Bug? sbatch not respecting MaxMemPerNode setting

Reply via email to