usage_in_bytes is not actually usage in bytes, by the way. It's often
close but I have seen wildly different values. See
https://lkml.org/lkml/2011/3/28/93 and
https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt section
5.5. memory.stat is what you want for accurate data.
I wrote the code you referenced below. Now that I know more about
failcnt, it does have some corner cases that aren't ideal. If I were to
start over I would use cgroup.event_control to get OOM events, such as
in
https://github.com/BYUHPC/uft/blob/master/oom_notifierd/oom_notifierd.c
or https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt section
9. At the time I didn't really feel like learning how to add and clean
up a thread or something that that would listen for those events.
If someone wants to do the work that would be great :). I have no plans
to do so myself for the time being.
Ryan
On 03/17/2017 08:46 AM, Sam Gallop (NBI) wrote:
Re: [slurm-dev] Re: Slurm & CGROUP
Hi,
I believe you can get that message ('Exceeded job memory limit at some
point') even if the job finishes fine. When the cgroup is created (by
SLURM) it updates memory.limit_in_bytes with the job memory request
coded in the job. During the life of the job the kernel updates a
number of files within the cgroup, one of which is
memory.usage_in_bytes - which is the current memory of the cgroup.
Periodically, SLURM will check if the cgroup has exceeded its limit
(i.e. memory.limit_in_bytes) - the frequency of the check is probably
set by JobAcctGatherFrequency. It does this by checking if
memory.failcnt is greater than one. The memory.failcnt is incremented
by the kernel each time memory.usage_in_bytes reaches the value set in
memory.limit_in_bytes.
This is the code snippet the produces the error (found in
task_cgroup_memory.c) …
extern int task_cgroup_memory_check_oom(stepd_step_rec_t *job)
{
...
else if (failcnt_non_zero(&step_memory_cg,
"memory.failcnt"))
/* reports the number of times that the
* memory limit has reached the value set
* in memory.limit_in_bytes.
*/
error("Exceeded step memory limit at some point.");
...
else if (failcnt_non_zero(&job_memory_cg,
"memory.failcnt"))
error("Exceeded job memory limit at some point.");
...
}
Anyway, back to the point. You can see this message and the job not
fail because the operating system counter (memory.failcnt) that SLURM
checks doesn't actually mean the memory limit has been exceeded but
means the memory limit has been reached - a subtle but an important
difference. Important because OOM doesn't terminate jobs upon
reaching the memory limit, only if they exceed the limit, it means the
job isn't terminated. Note: other cgroup files like memory.memsw.xxx
are also in play if you are using swap space
As to how to manage this. You can either not use cgroup and use an
alternative plugin, you could also try the JobAcctGatherParams
parameter NoOverMemoryKill (the documentation say use this with
caution, see https://slurm.schedmd.com/slurm.conf.html), or you can
try and account for the cache by using the jobacct_gather/cgroup.
Unfortunately, because of a bug this plugin does report cache usage
either. I've contributed a bug/fix to address this
(https://bugs.schedmd.com/show_bug.cgi?id=3531).
**
*---*
*Samuel Gallop*
/Computing infrastructure for Science/
*CiS Support & Development***
*From:*Wensheng Deng [mailto:w...@nyu.edu]
*Sent:* 17 March 2017 13:42
*To:* slurm-dev <slurm-dev@schedmd.com>
*Subject:* [slurm-dev] Re: Slurm & CGROUP
The file is copied fine. It is just the message error annoying.
On Thu, Mar 16, 2017 at 8:55 AM, Janne Blomqvist
<janne.blomqv...@aalto.fi <mailto:janne.blomqv...@aalto.fi>> wrote:
On 2017-03-15 17:52, Wensheng Deng wrote:
> No, it does not help:
>
> $ scontrol show config |grep -i jobacct
>
> *JobAcct*GatherFrequency = 30
>
> *JobAcct*GatherType = *jobacct*_gather/cgroup
>
> *JobAcct*GatherParams = NoShared
>
>
>
>
>
> On Wed, Mar 15, 2017 at 11:45 AM, Wensheng Deng <w...@nyu.edu
<mailto:w...@nyu.edu>
> <mailto:w...@nyu.edu <mailto:w...@nyu.edu>>> wrote:
>
> I think I tried that. let me try it again. Thank you!
>
> On Wed, Mar 15, 2017 at 11:43 AM, Chris Read <cr...@drw.com
<mailto:cr...@drw.com>
> <mailto:cr...@drw.com <mailto:cr...@drw.com>>> wrote:
>
>
> We explicitly exclude shared usage from our measurement:
>
>
> JobAcctGatherType=jobacct_gather/cgroup
> JobAcctGatherParams=NoShare?
>
> Chris
>
>
> ________________________________
> From: Wensheng Deng <w...@nyu.edu <mailto:w...@nyu.edu>
<mailto:w...@nyu.edu <mailto:w...@nyu.edu>>>
> Sent: 15 March 2017 10:28
> To: slurm-dev
> Subject: [ext] [slurm-dev] Re: Slurm & CGROUP
>
> It should be (sorry):
> we 'cp'ed a 5GB file from scratch to node local disk
>
>
> On Wed, Mar 15, 2017 at 11:26 AM, Wensheng Deng
<w...@nyu.edu <mailto:w...@nyu.edu>
> <mailto:w...@nyu.edu
<mailto:w...@nyu.edu>><mailto:w...@nyu.edu <mailto:w...@nyu.edu>
> <mailto:w...@nyu.edu <mailto:w...@nyu.edu>>>> wrote:
> Hello experts:
>
> We turn on TaskPlugin=task/cgroup. In one Slurm job, we 'cp'ed a
> 5GB job from scratch to node local disk, declared 5 GB memory
> for the job, and saw error message as below although the file
> was copied okay:
>
> slurmstepd: error: Exceeded job memory limit at some point.
>
> srun: error: [nodenameXXX]: task 0: Out Of Memory
>
> srun: Terminating job step 41.0
>
> slurmstepd: error: Exceeded job memory limit at some point.
>
>
> From the cgroup document
>https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
> <https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt>
> Features:
> - accounting anonymous pages, file caches, swap caches usage and
> limiting them.
>
> It seems that cgroup charges memory "RSS + file caches" to user
> process like 'cp', in our case, charged to user's jobs. swap is
> off in this case. The file cache can be small or very big, and
> it should not be charged to users' batch jobs in my opinion.
> How do other sites circumvent this issue? The Slurm version is
> 16.05.4.
>
> Thank you and Best Regards.
>
>
>
>
Could you set AllowedRamSpace/AllowedSwapSpace in
/etc/slurm/cgroup.conf to some big number? That way the job memory
limit will be the cgroup soft limit, and the cgroup hard limit
which is when the kernel will OOM kill the job would be
"job_memory_limit * AllowedRamSpace" that is, some large value?
--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 <tel:%2B358503841576> || janne.blomqv...@aalto.fi
<mailto:janne.blomqv...@aalto.fi>
--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University