Hello Cristóbal,
I think you might have a slight misunderstanding of how Slurm
works, which can cause this difference in expectation.
The MaxMemPerNode is there to allow the scheduler to plan job
placement according to resources. It does not enforce limitations
during job execution, only placement with the assumption that the
job will not use more than the resources it requested.
One option to limit the job during execution is through cgroups,
another might be using JobAcctGatherParams/OverMemoryKill
but I would suspect cgroups would indeed be the better option
for your use case, and see from the slurm.conf man page:
Kill processes that are being detected to use more memory
than requested by
steps every time accounting information is gathered by the
JobAcctGather plugin.
This parameter should be used with caution because a job
exceeding its memory
allocation may affect other processes and/or machine health.
NOTE: If available, it is recommended to limit memory
by enabling
task/cgroup as a TaskPlugin and making use of
ConstrainRAMSpace=yes in the
cgroup.conf instead of using this JobAcctGather mechanism for
memory
enforcement. Using JobAcctGather is polling based and there is
a
delay before a job is killed, which could lead to system Out
of Memory events.
NOTE: When using OverMemoryKill, if the
combined memory used by
all the processes in a step exceeds the memory limit, the
entire step will be
killed/cancelled by the JobAcctGather plugin.
This differs from the behavior when using ConstrainRAMSpace,
where
processes in the step will be killed, but the step will be
left active,
possibly with other processes left running.
On 12/01/2023 03:47:53, Cristóbal
Navarro wrote:
Hi Slurm community,
Recently we found a small problem triggered by one of our
jobs. We have a MaxMemPerNode=532000 setting in
our compute node in slurm.conf file, however we found out that
a job that started with mem=65536, and after hours of
execution it was able to grow its memory usage during
execution up to ~650GB. We expected that MaxMemPerNode
would stop any job exceeding the limit of 532000, did we miss
something in the slurm.conf file? We were trying to avoid
going into setting QOS for each group of users.
any help is welcome
Here is the node definition in the conf file
## Nodes list
## use native GPUs
NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16
ThreadsPerCore=1 RealMemory=1024000 MemSpecLimit=65556
State=UNKNOWN Gres=gpu:A100:8 Feature=gpu
And here is the full slurm.conf
file
# node health check
HealthCheckProgram=/usr/sbin/nhc
HealthCheckInterval=300
## Timeouts
SlurmctldTimeout=600
SlurmdTimeout=600
GresTypes=gpu
AccountingStorageTRES=gres/gpu
DebugFlags=CPU_Bind,gres
## We don't want a node to go back in pool without sys admin
acknowledgement
ReturnToService=0
## Basic scheduling
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
SchedulerType=sched/backfill
## Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
AccountingStorageHost=10.10.0.1
AccountingStorageEnforce=limits
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
TaskPlugin=task/cgroup
ProctrackType=proctrack/cgroup
## scripts
Epilog=/etc/slurm/epilog
Prolog=/etc/slurm/prolog
PrologFlags=Alloc
## MPI
MpiDefault=pmi2
## Nodes list
## use native GPUs
NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16
ThreadsPerCore=1 RealMemory=1024000 MemSpecLimit=65556
State=UNKNOWN Gres=gpu:A100:8 Feature=gpu
## Partitions list
PartitionName=gpu OverSubscribe=No MaxCPUsPerNode=64
DefMemPerNode=65556 DefCpuPerGPU=8 DefMemPerGPU=65556
MaxMemPerNode=532000 MaxTime=3-12:00:00 State=UP
Nodes=nodeGPU01 Default=YES
PartitionName=cpu OverSubscribe=No MaxCPUsPerNode=64
DefMemPerNode=16384 MaxMemPerNode=420000 MaxTime=3-12:00:00
State=UP Nodes=nodeGPU01
--
Regards,
Daniel Letai
+972 (0)505 870 456
|