Hi, Am 27.08.2014 um 18:58 schrieb Michael Stauffer:
> Thanks Reuit, see below > > > Am 25.08.2014 um 22:27 schrieb Michael Stauffer <mgsta...@gmail.com>: > > > > > > Version OGS/GE 2011.11p1 (Rocks 6.1) > > > > Hi, > > > > I'm using h_vmem and s_vmem to limit memory usage for qsub and qlogin jobs. > > A user's got some analyses running on nearly identical data sets that are > > hitting memory limits and being killed, which is fine, but the messages are > > inconsistent. Some instances report an exception from the app in question > > saying that memory can't be allocated. This app (an in-house tool) sends > > exceptions to stdout. Other instances just dump core and there's no message > > about memory problems in either stdout or stderr logs. > > > > h_vmem is 6000M and s_vmem is 5900M. It might be that the instances are > > right up against the s_vmem limit when the failing memory allocation > > occurs, and in some cases the requested amount triggers only the soft > > limit, and in other it triggers both. So perhpas the instances where it > > triggers the hard limit are the ones without the exception messages? > > Unfortunately the stderr and stdout log filenames don't contain job ids. > > But you can include the job id in the filename of the generated stdout/-err > file, or dump a `ps -e f` to stdout in the jobscript. The shepherd will also > contain the job id as argument. > > Yes sorry, I wasn't clear. I just meant that the output files I had to work > with from the user did not have the job id's included. In further tests, I > can include the job id. > > Do you catch the sigxcpu in the job script? > > No. Is this relevant for h_vmem and s_vmem limits? Passing the s_vmem limit will send a signal to the job. If you are not acting on it it will either abort the job (as default action for sigxcpu), or if it's ignored passing later on the h_vmem. This action for the limits is described in `man queue_conf`. What behavior did you expect when s_vmem is passed? > When the loglevel in SGE is set to log_info, it will also record the passed > limits in the messages file of the execd on the node. This is another place > to look at then. > > Great. qconf shows the level is currently log_warning, yet I still see > messages about catching s_vmem and h_vmem, which is very helpful. > > I've run some more tests with both a modified analysis script and a simple > bash script that eats memory. Each script has some commands to print to > stdout that run after the command that runs out of memory, so I can monitor > if the script keeps running after the mem limit is reached. I ran 100 > iterations of each of these, one run with h_vmem set higher than s_vmem, and > the other run vice versa. In both cases, I get about 90% of the iterations > with a 'clean' exit, in which I see an exception message from the offending > command that memory could not be allocated, and the script finishes running > after the offending command. In the remaining cases, the output shows neither > a memory exception message nor that the script finishes running. Does this > seem normal? AFAIK yes. The overall h_vmem consumption is observed by SGE and it will act when it's passed. But h_vmem will also set a kernel limit. Whichever of these two will notice it first, will take action. The difference is that SGE will accumulate all processes belonging to a job, while the kernel limit is per process. -- Reuti > -M > > > > However, in my first tests anyway, a qsub script that runs out of memory > > shows an exception message, even when s_vmem is higher than h_vmem. So I'm > > not sure about this line of reasoning. > > > > We're trying to figure it out and will run more tests, but I thought I'd > > check here first to see if anyone's had this kind of experience. Thanks. > > > > -M > > _______________________________________________ > > users mailing list > > users@gridengine.org > > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users