Nice, Unfortunately it does not work. I was wandering why are my jobs always showing “exit code 0” even in case they obviously failed due to the lack of memory. Thought it is a bug in GE….. ….but it is not.
What is probably happening here is: - We submit a job with memory constrains - Job runs in constrained “setrlimit()” environment - App needs some memory, so runs for example malloc(1024*1024*1024) - Malloc() returns ENOMEM - Application handles it gracefully, complains perhaps a little bit in its log file and exits (it is NOT killed) - SGE did not kill the job – it sees job ended itself so returns 0. Does it make a sense? If yes, is it possible to configure GE not to run the job in constrained environment (like the one we get with for example “qrsh –l h_vmem=1G”) but rather kill it instantly once it grows for too much? This way I believe we would get GE to show more accurate job exit status it case it was killed due to the lack of memory. Many thanks, Ondrej From: Reuti [mailto:re...@staff.uni-marburg.de] Sent: Tuesday, March 21, 2017 6:30 PM To: Ondrej Valousek <ondrej.valou...@s3group.com> Cc: William Hay <w....@ucl.ac.uk>; sge-discuss@liv.ac.uk <sge-disc...@liverpool.ac.uk> Subject: Customizing emails send from GridEngine (was: RE: [SGE-discuss] Enforcing memory limit for job) I changed the subject, as this is a setup which might be advantageous for other users too. The constraint one faces with changing and adding information to the email send from the exechost to the user is, that the email is send after the job left already the exechost and GridEngine. So there is nothing left which could be scanned for relevant information besides the messages file of the exechost. If this is information is sufficient (and no access to the job's context required), one can skip to item f) below. All paths need to be adjusted to reflect the structure you have in your cluster (also in the two supplied scripts). ==== a) Record the issued command in a job context: To implement this feature we need two definitions in the Bash profile: qsub-helper() { (export AC="`history 1 | perl -pe 's/^ *[0-9]+ +[^ ]+ ?//'`"; eval /usr/sge/bin/lx24-em64t/qsub -ac "COMMAND='qsub $AC'" "$AC") } alias qsub='qsub-helper #' (I got this hint on the Bash mailing list pointing to http://www.chiark.greenend.org.uk/~sgtatham/aliases.html) In case you wonder about the lx24-em64t while you might need lx24-amd64 here: in the beginning we had a mixture of Intel and AMD machines, and to target the correct binary I adjusted the `arch` script to behave accordingly and used $ARC in the job scripts while the directory trees below were congruent for type of CPU. ==== b) Prepare to store the context of a job after he finished: On all exechosts we need a directory, which is writable by the sgeadmin user (it could be in a shared directory too, I prefer it on each node): mkdir -p /var/spool/sge/context/ chown -R sgeadmin:gridware /var/spool/sge In case there is already a local spool directory, this can be used too. $ ls -lhd /var/spool/sge/context/ drwxr-xr-x 2 sgeadmin gridware 20K Mar 21 18:20 /var/spool/sge/context/ ==== c) Define a script recording the given job context Please find this script context.sh attached to this email. You may note, that for adding context information to the job it's necessary that the exechost is also a submission host. In case you don't need the list of used nodes in the email, these lines can be deleted and the need to be a submission host does not exist. There is also a common directory accessible only for admin stuff /home/common/dungeon defined, where the job scripts are copied to too. If you don't like or don't need this, it can also be removed from the script. $ ls -lhd /home/common/dungeon/ drwxr-s--- 2 sgeadmin operator 2.3M 2017-03-21 17:42 /home/common/dungeon/ ==== d) Attach this context.sh script to SGE $ qconf -sconf … prolog sgeadmin@/usr/sge/cluster/busybox env -u LD_LIBRARY_PATH -u LD_PRELOAD -u IFS /usr/sge/cluster/context.sh In case you don't have busybox available, you can define it as sgeadmin@/usr/sge/cluster/context.sh<mailto:sgeadmin@/usr/sge/cluster/context.sh> The use of busybox is a safety measure, as the environment of the job will be available during the execution of the context.sh and might change its behavior, but not a prerequisite. ==== e) Remove outdated job scripts after 2 month On the qmaster machine in /etc/crond./gridengine: 5 2 * * 0 sgeadmin find /home/common/dungeon/ -mtime +60 -delete It's also possible to define it for the sgeadmin user directly by `crontab -e`, note that in this case the column specifying the user must be left out. ==== f) Use a custom mailer to send the collected information: See attached mailer.sh. Also the created context file will be deleted therein. ==== I hope I didn't forget anything. Once in a while /var/spool/sge/context on the nodes needs to get a spring-cleaning. -- Reuti ----- The information contained in this e-mail and in any attachments is confidential and is designated solely for the attention of the intended recipient(s). If you are not an intended recipient, you must not use, disclose, copy, distribute or retain this e-mail or any part thereof. If you have received this e-mail in error, please notify the sender by return e-mail and delete all copies of this e-mail from your computer system(s). Please direct any additional queries to: communicati...@s3group.com. Thank You. Silicon and Software Systems Limited (S3 Group). Registered in Ireland no. 378073. Registered Office: South County Business Park, Leopardstown, Dublin 18. _______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss