Angel,
Unless you are using cgroups and constraints, there is no limit imposed.
The numbers are used by slurm to track what is available, not what you
may/may not use. So you could tell slurm the node only has 1GB and it
will not let you request more than that, but if you do request only 1GB,
without specific configuration, there is nothing stopping you from using
more than that.
So your request did not exceed what slurm sees as available (1 cpu using
4GB), so it is happy to let your script run. I suspect if you look at
the usage, you will see that 1 cpu spiked high while the others did nothing.
Brian Andrus
On 9/4/2024 1:37 AM, Angel de Vicente via slurm-users wrote:
Hello,
we found an issue with Slurm 24.05.1 and the MaxMemPerNode
setting. Slurm is installed in a single workstation, and thus, the
number of nodes is just 1.
The relevant sections in slurm.conf read:
,----
| EnforcePartLimits=ALL
| PartitionName=short Nodes=..... State=UP Default=YES MaxTime=2-00:00:00
MaxCPUsPerNode=76 MaxMemPerNode=231000 OverSubscribe=FORCE:1
`----
Now, if I submit a job requesting 76 CPUs and each one needing 4000M
(for a total of 304000M), Slurm does indeed respect the MaxMemPerNode
setting and the job is not submitted in the following cases ("-N 1" is
not really necessary, as there is only one node):
,----
| $ sbatch -N 1 -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch
| sbatch: error: Batch job submission failed: Memory required by task is not
available
|
| $ sbatch -N 1 -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch
| sbatch: error: Batch job submission failed: Memory required by task is not
available
|
| $ sbatch -n 1 -c 76 -p short --mem-per-cpu=4000M test.batch
| sbatch: error: Batch job submission failed: Memory required by task is not
available
`----
But with this submission Slurm is happy:
,----
| $ sbatch -n 76 -c 1 -p short --mem-per-cpu=4000M test.batch
| Submitted batch job 133982
`----
and the slurmjobcomp.log file does indeed tell me that the memory went
above MaxMemPerNode:
,----
| JobId=133982 UserId=......(10487) GroupId=domain users(2000) Name=test
JobState=CANCELLED Partition=short TimeLimit=45 StartTime=2024-09-04T09:11:17
EndTime=2024-09-04T09:11:24 NodeList=...... NodeCnt=1 ProcCnt=76 WorkDir=/tmp/.
ReservationName= Tres=cpu=76,mem=304000M,node=1,billing=76 Account=ddgroup
QOS=domino WcKey= Cluster=...... SubmitTime=2024-09-04T09:11:17
EligibleTime=2024-09-04T09:11:17 DerivedExitCode=0:0 ExitCode=0:0
`----
What is the best way to report issues like this to the Slurm developers?
I thought of adding it to https://support.schedmd.com/ but it is not
clear to me if that page is only meant for Slurm users with a Support
Contract?
Cheers,
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com