Hi, Sean: Slurm version 20.02.6 (via Bright Cluster Manager)
ProctrackType=proctrack/cgroup JobAcctGatherType=jobacct_gather/linux JobAcctGatherParams=UsePss,NoShared I just skimmed https://bugs.schedmd.com/show_bug.cgi?id=5549 because this job appeared to have left two slurmstepd zombie processes running at 100%CPU each, and changed to: ProctrackType=proctrack/cgroup JobAcctGatherType=jobacct_gather/cgroup JobAcctGatherParams=UsePss,NoShared,NoOverMemoryKill Have asked the user to re-run the job, but that has not happened, yet. cgroup.conf: CgroupMountpoint="/sys/fs/cgroup" CgroupAutomount=yes TaskAffinity=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=no ConstrainDevices=yes ConstrainKmemSpace=yes AllowedRamSpace=100.00 AllowedSwapSpace=0.00 MinKmemSpace=200 MaxKmemPercent=100.00 MemorySwappiness=100 MaxRAMPercent=100.00 MaxSwapPercent=100.00 MinRAMSpace=200 Cheers, Dave -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode ________________________________ From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Sean Crosby <scro...@unimelb.edu.au> Sent: Monday, March 15, 2021 15:22 To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value External. What are your Slurm settings - what's the values of ProctrackType JobAcctGatherType JobAcctGatherParams and what's the contents of cgroup.conf? Also, what version of Slurm are you using? Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia Drexel Internal Data