Hi,

Am 27.08.2014 um 18:58 schrieb Michael Stauffer:

> Thanks Reuit, see below
> 
> > Am 25.08.2014 um 22:27 schrieb Michael Stauffer <mgsta...@gmail.com>:
> >
> >
> > Version OGS/GE 2011.11p1 (Rocks 6.1)
> >
> > Hi,
> >
> > I'm using h_vmem and s_vmem to limit memory usage for qsub and qlogin jobs. 
> > A user's got some analyses running on nearly identical data sets that are 
> > hitting memory limits and being killed, which is fine, but the messages are 
> > inconsistent. Some instances report an exception from the app in question 
> > saying that memory can't be allocated. This app (an in-house tool) sends 
> > exceptions to stdout. Other instances just dump core and there's no message 
> > about memory problems in either stdout or stderr logs.
> >
> > h_vmem is 6000M and s_vmem is 5900M. It might be that the instances are 
> > right up against the s_vmem limit when the failing memory allocation 
> > occurs, and in some cases the requested amount triggers only the soft 
> > limit, and in other it triggers both. So perhpas the instances where it 
> > triggers the hard limit are the ones without the exception messages? 
> > Unfortunately the stderr and stdout log filenames don't contain job ids.
> 
> But you can include the job id in the filename of the generated stdout/-err 
> file, or dump a `ps -e f` to stdout in the jobscript. The shepherd will also 
> contain the job id as argument.
> 
> Yes sorry, I wasn't clear. I just meant that the output files I had to work 
> with from the user did not have the job id's included. In further tests, I 
> can include the job id.
>  
> Do you catch the sigxcpu in the job script?
> 
> No. Is this relevant for h_vmem and s_vmem limits?

Passing the s_vmem limit will send a signal to the job. If you are not acting 
on it it will either abort the job (as default action for sigxcpu), or if it's 
ignored passing later on the h_vmem. This action for the limits is described in 
`man queue_conf`. What behavior did you expect when s_vmem is passed?


> When the loglevel in SGE is set to log_info, it will also record the passed 
> limits in the messages file of the execd on the node. This is another place 
> to look at then.
> 
> Great. qconf shows the level is currently log_warning, yet I still see 
> messages about catching s_vmem and h_vmem, which is very helpful.
> 
> I've run some more tests with both a modified analysis script and a simple 
> bash script that eats memory. Each script has some commands to print to 
> stdout that run after the command that runs out of memory, so I can monitor 
> if the script keeps running after the mem limit is reached. I ran 100 
> iterations of each of these, one run with h_vmem set higher than s_vmem, and 
> the other run vice versa. In both cases, I get about 90% of the iterations 
> with a 'clean' exit, in which I see an exception message from the offending 
> command that memory could not be allocated, and the script finishes running 
> after the offending command. In the remaining cases, the output shows neither 
> a memory exception message nor that the script finishes running. Does this 
> seem normal?

AFAIK yes. The overall h_vmem consumption is observed by SGE and it will act 
when it's passed. But h_vmem will also set a kernel limit. Whichever of these 
two will notice it first, will take action. The difference is that SGE will 
accumulate all processes belonging to a job, while the kernel limit is per 
process.

-- Reuti


> -M
> 
> 
> > However, in my first tests anyway, a qsub script that runs out of memory 
> > shows an exception message, even when s_vmem is higher than h_vmem. So I'm 
> > not sure about this line of reasoning.
> >
> > We're trying to figure it out and will run more tests, but I thought I'd 
> > check here first to see if anyone's had this kind of experience. Thanks.
> >
> > -M
> > _______________________________________________
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to