Thanks Reuit, see below

> Am 25.08.2014 um 22:27 schrieb Michael Stauffer <mgsta...@gmail.com>:
> >
> >
> > Version OGS/GE 2011.11p1 (Rocks 6.1)
> >
> > Hi,
> >
> > I'm using h_vmem and s_vmem to limit memory usage for qsub and qlogin
> jobs. A user's got some analyses running on nearly identical data sets that
> are hitting memory limits and being killed, which is fine, but the messages
> are inconsistent. Some instances report an exception from the app in
> question saying that memory can't be allocated. This app (an in-house tool)
> sends exceptions to stdout. Other instances just dump core and there's no
> message about memory problems in either stdout or stderr logs.
> >
> > h_vmem is 6000M and s_vmem is 5900M. It might be that the instances are
> right up against the s_vmem limit when the failing memory allocation
> occurs, and in some cases the requested amount triggers only the soft
> limit, and in other it triggers both. So perhpas the instances where it
> triggers the hard limit are the ones without the exception messages?
> Unfortunately the stderr and stdout log filenames don't contain job ids.
>
> But you can include the job id in the filename of the generated
> stdout/-err file, or dump a `ps -e f` to stdout in the jobscript. The
> shepherd will also contain the job id as argument.
>

Yes sorry, I wasn't clear. I just meant that the output files I had to work
with from the user did not have the job id's included. In further tests, I
can include the job id.


> Do you catch the sigxcpu in the job script?
>

No. Is this relevant for h_vmem and s_vmem limits?


> When the loglevel in SGE is set to log_info, it will also record the
> passed limits in the messages file of the execd on the node. This is
> another place to look at then.


Great. qconf shows the level is currently log_warning, yet I still see
messages about catching s_vmem and h_vmem, which is very helpful.

I've run some more tests with both a modified analysis script and a simple
bash script that eats memory. Each script has some commands to print to
stdout that run after the command that runs out of memory, so I can monitor
if the script keeps running after the mem limit is reached. I ran 100
iterations of each of these, one run with h_vmem set higher than s_vmem,
and the other run vice versa. In both cases, I get about 90% of the
iterations with a 'clean' exit, in which I see an exception message from
the offending command that memory could not be allocated, and the script
finishes running after the offending command. In the remaining cases, the
output shows neither a memory exception message nor that the script
finishes running. Does this seem normal?

-M


> > However, in my first tests anyway, a qsub script that runs out of memory
> shows an exception message, even when s_vmem is higher than h_vmem. So I'm
> not sure about this line of reasoning.
> >
> > We're trying to figure it out and will run more tests, but I thought I'd
> check here first to see if anyone's had this kind of experience. Thanks.
> >
> > -M
> > _______________________________________________
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to