That's great. Could you attach the custom mailer definition so I could give it a try, too? Thanks,
Ondrej > -----Original Message----- > From: Reuti [mailto:re...@staff.uni-marburg.de] > Sent: Tuesday, March 21, 2017 3:54 PM > To: Ondrej Valousek <ondrej.valou...@s3group.com> > Cc: William Hay <w....@ucl.ac.uk>; sge-discuss@liv.ac.uk <sge- > disc...@liverpool.ac.uk> > Subject: Re: [SGE-discuss] Enforcing memory limit for job > > > > Am 21.03.2017 um 14:38 schrieb Ondrej Valousek > <ondrej.valou...@s3group.com>: > > > > Thanks. > > Is there any way how we could figure out why the job terminated? I.e. will > qacct tell whether the job finished normally/killed as run away of memory/ > terminated as h_rt exceeded or similar? > > No. Therein it's only recorded that it was killed. The information why it was > aborted is only visible in the messages file of the node. > > To attach this to the email the user gets in case of a job abort resp. end, we > have a custom mailer defined which will attach any information it finds in the > message file of the master node of the job. The email the user gets looks like > this then with the appended lines (some of the context variables we output > too): > > === > Job 217627 (ever) Aborted > Exit Status = 137 > Signal = KILL > User = reuti > Queue = common@node28 > Host = node28 > Start Time = 03/21/2017 15:50:22 > End Time = 03/21/2017 15:50:53 > CPU = 00:00:30 > Max vmem = 3.977M > failed assumedly after job because: > job 217627.1 died through signal KILL (9) > > Reason for job abort: > 03/21/2017 15:50:53| main|node28|W|job 217627.1 exceeded hard wallclock > time - initiate terminate method > > Used nodes: node28:1 > Issued command was: -m ea -l h_rt=30 -b y -ac "FOO=baz" ./ever === > > Sure, the two lines of the messages file could even be written to any central > file for the admin or stored in a database. In case a slave node passes this > limit, we don't get this information right now. For us many jobs run on a > single node anyway. > > Would this help? > > -- Reuti > > > > Ondrej > > > >> -----Original Message----- > >> From: Reuti [mailto:re...@staff.uni-marburg.de] > >> Sent: Tuesday, March 21, 2017 2:27 PM > >> To: Ondrej Valousek <ondrej.valou...@s3group.com> > >> Cc: William Hay <w....@ucl.ac.uk>; sge-discuss@liv.ac.uk <sge- > >> disc...@liverpool.ac.uk> > >> Subject: Re: [SGE-discuss] Enforcing memory limit for job > >> > >> > >>> Am 21.03.2017 um 14:12 schrieb Ondrej Valousek > >> <ondrej.valou...@s3group.com>: > >>> > >>> That's similar to difference between limiting the job via Control > >>> Groups or > >> shell limit. > >>> With cgroups - the jobs is killed once it attempts to get more. > >>> With shell limit, functions like mmap() return with ENOMEM. > >>> > >>> I like the shell limit more because an application have a chance to > >>> survive > >> ("phew, got ENOMEN, let's do things differently" - like command > >> "less" for example). > >>> > >>> Anyway - you are saying that even with h_vmem set to "JOB", > >>> gridengine > >> would kill the job sooner than the shell limitation takes a place, right? > >> > >> Correct. If submitted with -notify it might even get a warning before > >> it gets killed, or a signal if s_vmem is passed. It's outlined in > >> `man queue_conf`section "RESOURCE LIMITS". > >> > >> -- Reuti > >> > >> > >>> > >>> Thanks, > >>> Ondrej > >>> > >>> > >>>> -----Original Message----- > >>>> From: Reuti [mailto:re...@staff.uni-marburg.de] > >>>> Sent: Tuesday, March 21, 2017 2:05 PM > >>>> To: Ondrej Valousek <ondrej.valou...@s3group.com> > >>>> Cc: William Hay <w....@ucl.ac.uk>; sge-discuss@liv.ac.uk <sge- > >>>> disc...@liverpool.ac.uk> > >>>> Subject: Re: [SGE-discuss] Enforcing memory limit for job > >>>> > >>>> > >>>>> Am 21.03.2017 um 13:06 schrieb Ondrej Valousek > >>>> <ondrej.valou...@s3group.com>: > >>>>> > >>>>> I was trying with h_vmem and my findings are: > >>>>> > >>>>> Memory consumed is multiplied by the number of slots user request > >>>>> in > >>>> case h_vmem is set to "Yes". It is not multiplied if set to "Job". > >>>>> However, job memory limit (as per "ulimit") is in both cases > >>>>> multiplied by > >>>> the number of slots consumed which is bit confusing. > >>>>> > >>>>> So if we set up H-vmem to "JOB" and 2Gb, then "qrsh -pe ade 4", > >>>>> then we > >>>> obtain a shell with 16Gb virtual memory limit but only 2Gb is > >>>> subtracted from the host memory. > >>>>> > >>>>> Is that behavior expected? > >>>> > >>>> It should not hurt the behavior. As SGE on its own is checking the > >>>> overall h_vmem consumption by the additional group id, it should > >>>> kill the job in case it bypasses the (lower) limit. It would just > >>>> be matter of: who notice the violation first? > >>>> > >>>> -- Reuti > >>>> > >>>> > >>>>> > >>>>> Thanks, > >>>>> > >>>>> Ondrej > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: William Hay [mailto:w....@ucl.ac.uk] > >>>>>> Sent: Tuesday, March 14, 2017 4:18 PM > >>>>>> To: Ondrej Valousek <ondrej.valou...@s3group.com> > >>>>>> Cc: sge-discuss@liv.ac.uk <sge-disc...@liverpool.ac.uk> > >>>>>> Subject: Re: [SGE-discuss] Enforcing memory limit for job > >>>>>> > >>>>>> On Fri, Mar 10, 2017 at 11:46:26AM +0000, Ondrej Valousek wrote: > >>>>>>> Hi List, > >>>>>>> > >>>>>>> I need a help with setting default limits for jobs. > >>>>>>> I would need something that would limit job memory consumption > >>>>>>> to say > >>>>>> 20Gb but was not consumable unless explicitly specified by a user. > >>>>>>> > >>>>>>> I thought setting h_vmem attribute in GE complex configuration > >>>>>>> would do > >>>>>> the trick, but it is consumable, so lauching few little terminal > >>>>>> jobs into the farm would soon fill resources everywhere. > >>>>>>> > >>>>>>> Is there some attribute like this? > >>>>>>> Thanks, > >>>>>> You can change h_vmem to be non-consumable if you like. > >>>>>> > >>>>>> If you want it consumable for those who request it but not for > >>>>>> those who don't you may be able to exploit the difference between > >>>>>> configuring h_vmem as a queue resource_limit and configuring it > >>>>>> under > >>>> complex_values. > >>>>>> As a queue_limit it should apply whether you request it or not. > >>>>>> Under complex_values it should be requestable. > >>>>>> > >>>>>> (NB: never tried this). > >>>>>> > >>>>>> William > >>>>> ----- > >>>>> > >>>>> The information contained in this e-mail and in any attachments is > >>>> confidential and is designated solely for the attention of the > >>>> intended recipient(s). If you are not an intended recipient, you > >>>> must not use, disclose, copy, distribute or retain this e-mail or > >>>> any part thereof. If you have received this e-mail in error, please > >>>> notify the sender by return e-mail and delete all copies of this > >>>> e-mail from your computer system(s). Please direct any additional > queries to: > >>>> communicati...@s3group.com. Thank You. Silicon and Software > Systems > >> Limited (S3 Group). Registered in Ireland no. 378073. > >>>> Registered Office: South County Business Park, Leopardstown, Dublin > 18. > >>>>> > >>>>> _______________________________________________ > >>>>> SGE-discuss mailing list > >>>>> SGE-discuss@liv.ac.uk > >>>>> https://arc.liv.ac.uk/mailman/listinfo/sge-discuss > >>>>> > >>> > >>> ----- > >>> > >>> The information contained in this e-mail and in any attachments is > >> confidential and is designated solely for the attention of the > >> intended recipient(s). If you are not an intended recipient, you must > >> not use, disclose, copy, distribute or retain this e-mail or any part > >> thereof. If you have received this e-mail in error, please notify the > >> sender by return e-mail and delete all copies of this e-mail from > >> your computer system(s). Please direct any additional queries to: > >> communicati...@s3group.com. Thank You. Silicon and Software Systems > Limited (S3 Group). Registered in Ireland no. 378073. > >> Registered Office: South County Business Park, Leopardstown, Dublin 18. > >>> > >>> > > > > ----- > > > > The information contained in this e-mail and in any attachments is > confidential and is designated solely for the attention of the intended > recipient(s). If you are not an intended recipient, you must not use, > disclose, > copy, distribute or retain this e-mail or any part thereof. If you have > received > this e-mail in error, please notify the sender by return e-mail and delete all > copies of this e-mail from your computer system(s). Please direct any > additional queries to: communicati...@s3group.com. Thank You. Silicon and > Software Systems Limited (S3 Group). Registered in Ireland no. 378073. > Registered Office: South County Business Park, Leopardstown, Dublin 18. > > > > ----- The information contained in this e-mail and in any attachments is confidential and is designated solely for the attention of the intended recipient(s). If you are not an intended recipient, you must not use, disclose, copy, distribute or retain this e-mail or any part thereof. If you have received this e-mail in error, please notify the sender by return e-mail and delete all copies of this e-mail from your computer system(s). Please direct any additional queries to: communicati...@s3group.com. Thank You. Silicon and Software Systems Limited (S3 Group). Registered in Ireland no. 378073. Registered Office: South County Business Park, Leopardstown, Dublin 18. _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss