Nice,
Unfortunately it does not work.

I was wandering why are my jobs always showing “exit code 0” even in case they 
obviously failed due to the lack of memory.
Thought it is a bug in GE…..
….but it is not.

What is probably happening here is:

-          We submit a job with memory constrains

-          Job runs in constrained “setrlimit()” environment

-          App needs some memory, so runs for example malloc(1024*1024*1024)

-          Malloc() returns ENOMEM

-          Application handles it gracefully, complains perhaps a little bit in 
its log  file and exits (it is NOT killed)

-          SGE did not kill the job – it sees job ended itself so returns 0.

Does it make a sense?
If yes, is it possible to configure GE not to run the job in constrained 
environment (like the one we get with for example “qrsh –l h_vmem=1G”) but 
rather kill it instantly once it grows for too much?
This way I believe we would get GE to show more accurate job exit status it 
case it was killed due to the lack of memory.

Many thanks,
Ondrej



From: Reuti [mailto:re...@staff.uni-marburg.de]
Sent: Tuesday, March 21, 2017 6:30 PM
To: Ondrej Valousek <ondrej.valou...@s3group.com>
Cc: William Hay <w....@ucl.ac.uk>; sge-discuss@liv.ac.uk 
<sge-disc...@liverpool.ac.uk>
Subject: Customizing emails send from GridEngine (was: RE: [SGE-discuss] 
Enforcing memory limit for job)

I changed the subject, as this is a setup which might be advantageous for other 
users too.


The constraint one faces with changing and adding information to the email send 
from the exechost to the user is, that the email is send after the job left 
already the exechost and GridEngine. So there is nothing left which could be 
scanned for relevant information besides the messages file of the exechost. If 
this is information is sufficient (and no access to the job's context 
required), one can skip to item f) below. All paths need to be adjusted to 
reflect the structure you have in your cluster (also in the two supplied 
scripts).

====

a) Record the issued command in a job context:

To implement this feature we need two definitions in the Bash profile:

qsub-helper() { (export AC="`history 1 | perl -pe 's/^ *[0-9]+ +[^ ]+ ?//'`"; 
eval /usr/sge/bin/lx24-em64t/qsub -ac "COMMAND='qsub $AC'" "$AC") }
alias qsub='qsub-helper #'

(I got this hint on the Bash mailing list pointing to 
http://www.chiark.greenend.org.uk/~sgtatham/aliases.html)

In case you wonder about the lx24-em64t while you might need lx24-amd64 here: 
in the beginning we had a mixture of Intel and AMD machines, and to target the 
correct binary I adjusted the `arch` script to behave accordingly and used $ARC 
in the job scripts while the directory trees below were congruent for type of 
CPU.

====

b) Prepare to store the context of a job after he finished:

On all exechosts we need a directory, which is writable by the sgeadmin user 
(it could be in a shared directory too, I prefer it on each node):

mkdir -p /var/spool/sge/context/
chown -R sgeadmin:gridware /var/spool/sge

In case there is already a local spool directory, this can be used too.

$ ls -lhd /var/spool/sge/context/
drwxr-xr-x 2 sgeadmin gridware 20K Mar 21 18:20 /var/spool/sge/context/

====

c) Define a script recording the given job context

Please find this script context.sh attached to this email. You may note, that 
for adding context information to the job it's necessary that the exechost is 
also a submission host. In case you don't need the list of used nodes in the 
email, these lines can be deleted and the need to be a submission host does not 
exist. There is also a common directory accessible only for admin stuff 
/home/common/dungeon defined, where the job scripts are copied to too. If you 
don't like or don't need this, it can also be removed from the script.

$ ls -lhd /home/common/dungeon/
drwxr-s--- 2 sgeadmin operator 2.3M 2017-03-21 17:42 /home/common/dungeon/

====

d) Attach this context.sh script to SGE

$ qconf -sconf
…
prolog                       sgeadmin@/usr/sge/cluster/busybox env -u 
LD_LIBRARY_PATH -u LD_PRELOAD -u IFS /usr/sge/cluster/context.sh

In case you don't have busybox available, you can define it as 
sgeadmin@/usr/sge/cluster/context.sh<mailto:sgeadmin@/usr/sge/cluster/context.sh>
 The use of busybox is a safety measure, as the environment of the job will be 
available during the execution of the context.sh and might change its behavior, 
but not a prerequisite.

====

e) Remove outdated job scripts after 2 month

On the qmaster machine in /etc/crond./gridengine:

5 2 * * 0       sgeadmin        find /home/common/dungeon/ -mtime +60 -delete

It's also possible to define it for the sgeadmin user directly by `crontab -e`, 
note that in this case the column specifying the user must be left out.

====

f) Use a custom mailer to send the collected information:

See attached mailer.sh. Also the created context file will be deleted therein.

====

I hope I didn't forget anything.

Once in a while /var/spool/sge/context on the nodes needs to get a 
spring-cleaning.


-- Reuti



-----

The information contained in this e-mail and in any attachments is confidential 
and is designated solely for the attention of the intended recipient(s). If you 
are not an intended recipient, you must not use, disclose, copy, distribute or 
retain this e-mail or any part thereof. If you have received this e-mail in 
error, please notify the sender by return e-mail and delete all copies of this 
e-mail from your computer system(s). Please direct any additional queries to: 
communicati...@s3group.com. Thank You. Silicon and Software Systems Limited (S3 
Group). Registered in Ireland no. 378073. Registered Office: South County 
Business Park, Leopardstown, Dublin 18.
_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Reply via email to