[slurm-dev] Re: Slurm & CGROUP

Ryan Cox Fri, 17 Mar 2017 08:58:50 -0700

usage_in_bytes is not actually usage in bytes, by the way. It's oftenclose but I have seen wildly different values. Seehttps://lkml.org/lkml/2011/3/28/93 andhttps://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt section5.5. memory.stat is what you want for accurate data.

I wrote the code you referenced below. Now that I know more aboutfailcnt, it does have some corner cases that aren't ideal. If I were tostart over I would use cgroup.event_control to get OOM events, such asinhttps://github.com/BYUHPC/uft/blob/master/oom_notifierd/oom_notifierd.cor https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt section9. At the time I didn't really feel like learning how to add and cleanup a thread or something that that would listen for those events.

If someone wants to do the work that would be great :). I have no plansto do so myself for the time being.


Ryan

On 03/17/2017 08:46 AM, Sam Gallop (NBI) wrote:

Re: [slurm-dev] Re: Slurm & CGROUP

Hi,
I believe you can get that message ('Exceeded job memory limit at somepoint') even if the job finishes fine. When the cgroup is created (bySLURM) it updates memory.limit_in_bytes with the job memory requestcoded in the job. During the life of the job the kernel updates anumber of files within the cgroup, one of which ismemory.usage_in_bytes - which is the current memory of the cgroup.Periodically, SLURM will check if the cgroup has exceeded its limit(i.e. memory.limit_in_bytes) - the frequency of the check is probablyset by JobAcctGatherFrequency. It does this by checking ifmemory.failcnt is greater than one. The memory.failcnt is incrementedby the kernel each time memory.usage_in_bytes reaches the value set inmemory.limit_in_bytes.
This is the code snippet the produces the error (found intask_cgroup_memory.c) …
extern int task_cgroup_memory_check_oom(stepd_step_rec_t *job)

{

...

else if (failcnt_non_zero(&step_memory_cg,

            "memory.failcnt"))

/* reports the number of times that the

* memory limit has reached the value set

* in memory.limit_in_bytes.

*/

error("Exceeded step memory limit at some point.");

...

else if (failcnt_non_zero(&job_memory_cg,

"memory.failcnt"))

error("Exceeded job memory limit at some point.");

...

}
Anyway, back to the point. You can see this message and the job notfail because the operating system counter (memory.failcnt) that SLURMchecks doesn't actually mean the memory limit has been exceeded butmeans the memory limit has been reached - a subtle but an importantdifference. Important because OOM doesn't terminate jobs uponreaching the memory limit, only if they exceed the limit, it means thejob isn't terminated. Note: other cgroup files like memory.memsw.xxxare also in play if you are using swap space
As to how to manage this. You can either not use cgroup and use analternative plugin, you could also try the JobAcctGatherParamsparameter NoOverMemoryKill (the documentation say use this withcaution, see https://slurm.schedmd.com/slurm.conf.html), or you cantry and account for the cache by using the jobacct_gather/cgroup.Unfortunately, because of a bug this plugin does report cache usageeither. I've contributed a bug/fix to address this(https://bugs.schedmd.com/show_bug.cgi?id=3531).
**

*---*

*Samuel Gallop*

/Computing infrastructure for Science/

*CiS Support & Development***

*From:*Wensheng Deng [mailto:w...@nyu.edu]
*Sent:* 17 March 2017 13:42
*To:* slurm-dev <slurm-dev@schedmd.com>
*Subject:* [slurm-dev] Re: Slurm & CGROUP

The file is copied fine. It is just the message error annoying.
On Thu, Mar 16, 2017 at 8:55 AM, Janne Blomqvist<janne.blomqv...@aalto.fi <mailto:janne.blomqv...@aalto.fi>> wrote:
    On 2017-03-15 17:52, Wensheng Deng wrote:
    > No, it does not help:
    >
    > $ scontrol show config |grep -i jobacct
    >
    > *JobAcct*GatherFrequency  = 30
    >
    > *JobAcct*GatherType       = *jobacct*_gather/cgroup
    >
    > *JobAcct*GatherParams     = NoShared
    >
    >
    >
    >
    >
    > On Wed, Mar 15, 2017 at 11:45 AM, Wensheng Deng <w...@nyu.edu
    <mailto:w...@nyu.edu>
    > <mailto:w...@nyu.edu <mailto:w...@nyu.edu>>> wrote:
    >
    >     I think I tried that. let me try it again. Thank you!
    >
    >     On Wed, Mar 15, 2017 at 11:43 AM, Chris Read <cr...@drw.com
    <mailto:cr...@drw.com>
    >     <mailto:cr...@drw.com <mailto:cr...@drw.com>>> wrote:
    >
    >
    >         We explicitly exclude shared usage from our measurement:
    >
    >
    >         JobAcctGatherType=jobacct_gather/cgroup
    >         JobAcctGatherParams=NoShare?
    >
    >         Chris
    >
    >
    >         ________________________________
    >         From: Wensheng Deng <w...@nyu.edu <mailto:w...@nyu.edu>
    <mailto:w...@nyu.edu <mailto:w...@nyu.edu>>>
    >         Sent: 15 March 2017 10:28
    >         To: slurm-dev
    >         Subject: [ext] [slurm-dev] Re: Slurm & CGROUP
    >
    >         It should be (sorry):
    >         we 'cp'ed a 5GB file from scratch to node local disk
    >
    >
    >         On Wed, Mar 15, 2017 at 11:26 AM, Wensheng Deng
    <w...@nyu.edu <mailto:w...@nyu.edu>
    >         <mailto:w...@nyu.edu
    <mailto:w...@nyu.edu>><mailto:w...@nyu.edu <mailto:w...@nyu.edu>
    >         <mailto:w...@nyu.edu <mailto:w...@nyu.edu>>>> wrote:
    >         Hello experts:
    >
    >         We turn on TaskPlugin=task/cgroup. In one Slurm job, we 'cp'ed a
    >         5GB job from scratch to node local disk, declared 5 GB memory
    >         for the job, and saw error message as below although the file
    >         was copied okay:
    >
    >         slurmstepd: error: Exceeded job memory limit at some point.
    >
    >         srun: error: [nodenameXXX]: task 0: Out Of Memory
    >
    >         srun: Terminating job step 41.0
    >
    >         slurmstepd: error: Exceeded job memory limit at some point.
    >
    >
    >         From the cgroup document
    >https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
    >         <https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt>
    >         Features:
    >         - accounting anonymous pages, file caches, swap caches usage and
    >         limiting them.
    >
    >         It seems that cgroup charges memory "RSS + file caches" to user
    >         process like 'cp', in our case, charged to user's jobs. swap is
    >         off in this case. The file cache can be small or very big, and
    >         it should not be charged to users'  batch jobs in my opinion.
    >         How do other sites circumvent this issue? The Slurm version is
    >         16.05.4.
    >
    >         Thank you and Best Regards.
    >
    >
    >
    >

    Could you set AllowedRamSpace/AllowedSwapSpace in
    /etc/slurm/cgroup.conf to some big number? That way the job memory
    limit will be the cgroup soft limit, and the cgroup hard limit
    which is when the kernel will OOM kill the job would be
    "job_memory_limit * AllowedRamSpace" that is, some large value?

    --
    Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
    Aalto University School of Science, PHYS & NBE
    +358503841576 <tel:%2B358503841576> || janne.blomqv...@aalto.fi
    <mailto:janne.blomqv...@aalto.fi>


--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University

[slurm-dev] Re: Slurm & CGROUP

Reply via email to