>
> > > Version OGS/GE 2011.11p1 (Rocks 6.1)
> > >
> > > Hi,
> > >
> > > I'm using h_vmem and s_vmem to limit memory usage for qsub and qlogin
> jobs. A user's got some analyses running on nearly identical data sets that
> are hitting memory limits and being killed, which is fine, but the messages
> are inconsistent. Some instances report an exception from the app in
> question saying that memory can't be allocated. This app (an in-house tool)
> sends exceptions to stdout. Other instances just dump core and there's no
> message about memory problems in either stdout or stderr logs.
> > >
> > > h_vmem is 6000M and s_vmem is 5900M. It might be that the instances
> are right up against the s_vmem limit when the failing memory allocation
> occurs, and in some cases the requested amount triggers only the soft
> limit, and in other it triggers both. So perhpas the instances where it
> triggers the hard limit are the ones without the exception messages?
> Unfortunately the stderr and stdout log filenames don't contain job ids.
> >
> > But you can include the job id in the filename of the generated
> stdout/-err file, or dump a `ps -e f` to stdout in the jobscript. The
> shepherd will also contain the job id as argument.
> >
> > Yes sorry, I wasn't clear. I just meant that the output files I had to
> work with from the user did not have the job id's included. In further
> tests, I can include the job id.
> >
> > Do you catch the sigxcpu in the job script?
> >
> > No. Is this relevant for h_vmem and s_vmem limits?
>
> Passing the s_vmem limit will send a signal to the job. If you are not
> acting on it it will either abort the job (as default action for sigxcpu),
> or if it's ignored passing later on the h_vmem. This action for the limits
> is described in `man queue_conf`. What behavior did you expect when s_vmem
> is passed?
>

Seems I hadn't thought it through enough. I'd figured that the signal would
be caught by the application that's run via the job script, and if the
application handles it, it will quit gracefully. But what you're saying is
that the signal goes to the job script process? I guess that makes sense.

I'm experimenting with 'trap' to catch sigxcpu, but it's not working to
trap it. I've tried having the script reach both vmem and cput-time
ulimits. I've tried in a simple script running both on FE and via qsub. In
the same script if I trap SIGINT, it works to trap it and execute my trap
command. Anyone had this issue before?

And if I can get sigxcpu to work in a trap, is there a way to add a trap
command to every script that gets submitted via qsub, or to run each qsub
command in a wrapper script that includes a trap?


>
>
> > When the loglevel in SGE is set to log_info, it will also record the
> passed limits in the messages file of the execd on the node. This is
> another place to look at then.
> >
> > Great. qconf shows the level is currently log_warning, yet I still see
> messages about catching s_vmem and h_vmem, which is very helpful.
> >
>

However, no that I test this more, sometimes I do NOT see messages on a
node about a job that was just terminated due to memory limits. Would this
be because of the situation you describe below, when the kernel acts before
SGE to kill the job?


> > I've run some more tests with both a modified analysis script and a
> simple bash script that eats memory. Each script has some commands to print
> to stdout that run after the command that runs out of memory, so I can
> monitor if the script keeps running after the mem limit is reached. I ran
> 100 iterations of each of these, one run with h_vmem set higher than
> s_vmem, and the other run vice versa. In both cases, I get about 90% of the
> iterations with a 'clean' exit, in which I see an exception message from
> the offending command that memory could not be allocated, and the script
> finishes running after the offending command. In the remaining cases, the
> output shows neither a memory exception message nor that the script
> finishes running. Does this seem normal?
>
> AFAIK yes. The overall h_vmem consumption is observed by SGE and it will
> act when it's passed. But h_vmem will also set a kernel limit. Whichever of
> these two will notice it first, will take action. The difference is that
> SGE will accumulate all processes belonging to a job, while the kernel
> limit is per process.
>

OK, so you're saying when SGE acts on h_vmem, I'm getting a clean exit, but
when the kernel catches it first, I'm getting the 'no message' exit?

Thanks.

-M


>
> -- Reuti
>
>
> > -M
> >
> >
> > > However, in my first tests anyway, a qsub script that runs out of
> memory shows an exception message, even when s_vmem is higher than h_vmem.
> So I'm not sure about this line of reasoning.
> > >
> > > We're trying to figure it out and will run more tests, but I thought
> I'd check here first to see if anyone's had this kind of experience. Thanks.
> > >
> > > -M
> > > _______________________________________________
> > > users mailing list
> > > users@gridengine.org
> > > https://gridengine.org/mailman/listinfo/users
> >
>
>
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to