Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

Chin,David Mon, 15 Mar 2021 12:36:03 -0700

Hi, Sean:

Slurm version 20.02.6 (via Bright Cluster Manager)


  ProctrackType=proctrack/cgroup
  JobAcctGatherType=jobacct_gather/linux
  JobAcctGatherParams=UsePss,NoShared


I just skimmed https://bugs.schedmd.com/show_bug.cgi?id=5549 because this job 
appeared to have left two slurmstepd zombie processes running at 100%CPU each, 
and changed to:

  ProctrackType=proctrack/cgroup
  JobAcctGatherType=jobacct_gather/cgroup
  JobAcctGatherParams=UsePss,NoShared,NoOverMemoryKill

Have asked the user to re-run the job, but that has not happened, yet.

cgroup.conf:

  CgroupMountpoint="/sys/fs/cgroup"
  CgroupAutomount=yes
  TaskAffinity=yes
  ConstrainCores=yes
  ConstrainRAMSpace=yes
  ConstrainSwapSpace=no
  ConstrainDevices=yes
  ConstrainKmemSpace=yes
  AllowedRamSpace=100.00
  AllowedSwapSpace=0.00
  MinKmemSpace=200
  MaxKmemPercent=100.00
  MemorySwappiness=100
  MaxRAMPercent=100.00
  MaxSwapPercent=100.00
  MinRAMSpace=200


Cheers,
    Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu                     215.571.4335 (o)
For URCF support: urcf-supp...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode


________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Sean 
Crosby <scro...@unimelb.edu.au>
Sent: Monday, March 15, 2021 15:22
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though 
MaxRSS and MaxVMSize are under the ReqMem value


External.

What are your Slurm settings - what's the values of

ProctrackType
JobAcctGatherType
JobAcctGatherParams

and what's the contents of cgroup.conf? Also, what version of Slurm are you 
using?

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



Drexel Internal Data

Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

Reply via email to