Many thanks Rodrigo and Daniel, Indeed I misunderstood that part of Slurm, so thanks for clarifying this aspect now it makes a lot of sense. Regarding the approach, I went with the cgroup.conf approach as suggested by both. I will start doing some synthetic tests to make sure the job gets killed once it surpasses memory. many thanks again
On Fri, Jan 13, 2023 at 3:49 AM Daniel Letai <d...@letai.org.il> wrote: > Hello Cristóbal, > > > I think you might have a slight misunderstanding of how Slurm works, which > can cause this difference in expectation. > > > The MaxMemPerNode is there to allow the scheduler to plan job placement > according to resources. It does not enforce limitations during job > execution, only placement with the assumption that the job will not use > more than the resources it requested. > > > One option to limit the job during execution is through cgroups, another > might be using *JobAcctGatherParams/**OverMemoryKill *but I would suspect > cgroups would indeed be the better option for your use case, and see from > the slurm.conf man page: > > > Kill processes that are being detected to use more memory than requested > by steps every time accounting information is gathered by the JobAcctGather > plugin. This parameter should be used with caution because a job exceeding > its memory allocation may affect other processes and/or machine health. > > *NOTE*: If available, it is recommended to limit memory by enabling > task/cgroup as a TaskPlugin and making use of ConstrainRAMSpace=yes in the > cgroup.conf instead of using this JobAcctGather mechanism for memory > enforcement. Using JobAcctGather is polling based and there is a delay > before a job is killed, which could lead to system Out of Memory events. > > *NOTE*: When using *OverMemoryKill*, if the combined memory used by all > the processes in a step exceeds the memory limit, the entire step will be > killed/cancelled by the JobAcctGather plugin. This differs from the > behavior when using *ConstrainRAMSpace*, where processes in the step will > be killed, but the step will be left active, possibly with other processes > left running. > > > > On 12/01/2023 03:47:53, Cristóbal Navarro wrote: > > Hi Slurm community, > Recently we found a small problem triggered by one of our jobs. We have a > *MaxMemPerNode*=*532000* setting in our compute node in slurm.conf file, > however we found out that a job that started with mem=65536, and after > hours of execution it was able to grow its memory usage during execution up > to ~650GB. We expected that *MaxMemPerNode* would stop any job exceeding > the limit of 532000, did we miss something in the slurm.conf file? We were > trying to avoid going into setting QOS for each group of users. > any help is welcome > > Here is the node definition in the conf file > ## Nodes list > ## use native GPUs > NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1 > RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8 > Feature=gpu > > > And here is the full slurm.conf file > # node health check > HealthCheckProgram=/usr/sbin/nhc > HealthCheckInterval=300 > > ## Timeouts > SlurmctldTimeout=600 > SlurmdTimeout=600 > > GresTypes=gpu > AccountingStorageTRES=gres/gpu > DebugFlags=CPU_Bind,gres > > ## We don't want a node to go back in pool without sys admin > acknowledgement > ReturnToService=0 > > ## Basic scheduling > SelectType=select/cons_tres > SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE > SchedulerType=sched/backfill > > ## Accounting > AccountingStorageType=accounting_storage/slurmdbd > AccountingStoreJobComment=YES > AccountingStorageHost=10.10.0.1 > AccountingStorageEnforce=limits > > JobAcctGatherFrequency=30 > JobAcctGatherType=jobacct_gather/linux > > TaskPlugin=task/cgroup > ProctrackType=proctrack/cgroup > > ## scripts > Epilog=/etc/slurm/epilog > Prolog=/etc/slurm/prolog > PrologFlags=Alloc > > ## MPI > MpiDefault=pmi2 > > ## Nodes list > ## use native GPUs > NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1 > RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8 > Feature=gpu > > ## Partitions list > PartitionName=gpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=65556 > DefCpuPerGPU=8 DefMemPerGPU=65556 MaxMemPerNode=532000 MaxTime=3-12:00:00 > State=UP Nodes=nodeGPU01 Default=YES > PartitionName=cpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=16384 > MaxMemPerNode=420000 MaxTime=3-12:00:00 State=UP Nodes=nodeGPU01 > > > -- > Cristóbal A. Navarro > > -- > Regards, > > Daniel Letai > +972 (0)505 870 456 > > -- Cristóbal A. Navarro