I changed the subject, as this is a setup which might be advantageous for other 
users too.


The constraint one faces with changing and adding information to the email send 
from the exechost to the user is, that the email is send after the job left 
already the exechost and GridEngine. So there is nothing left which could be 
scanned for relevant information besides the messages file of the exechost. If 
this is information is sufficient (and no access to the job's context 
required), one can skip to item f) below. All paths need to be adjusted to 
reflect the structure you have in your cluster (also in the two supplied 
scripts).

====

a) Record the issued command in a job context:

To implement this feature we need two definitions in the Bash profile:

qsub-helper() { (export AC="`history 1 | perl -pe 's/^ *[0-9]+ +[^ ]+ ?//'`"; 
eval /usr/sge/bin/lx24-em64t/qsub -ac "COMMAND='qsub $AC'" "$AC") }
alias qsub='qsub-helper #'

(I got this hint on the Bash mailing list pointing to 
http://www.chiark.greenend.org.uk/~sgtatham/aliases.html)

In case you wonder about the lx24-em64t while you might need lx24-amd64 here: 
in the beginning we had a mixture of Intel and AMD machines, and to target the 
correct binary I adjusted the `arch` script to behave accordingly and used $ARC 
in the job scripts while the directory trees below were congruent for type of 
CPU.

====

b) Prepare to store the context of a job after he finished:

On all exechosts we need a directory, which is writable by the sgeadmin user 
(it could be in a shared directory too, I prefer it on each node):

mkdir -p /var/spool/sge/context/
chown -R sgeadmin:gridware /var/spool/sge

In case there is already a local spool directory, this can be used too.

$ ls -lhd /var/spool/sge/context/
drwxr-xr-x 2 sgeadmin gridware 20K Mar 21 18:20 /var/spool/sge/context/

====

c) Define a script recording the given job context

Please find this script context.sh attached to this email. You may note, that 
for adding context information to the job it's necessary that the exechost is 
also a submission host. In case you don't need the list of used nodes in the 
email, these lines can be deleted and the need to be a submission host does not 
exist. There is also a common directory accessible only for admin stuff 
/home/common/dungeon defined, where the job scripts are copied to too. If you 
don't like or don't need this, it can also be removed from the script.

$ ls -lhd /home/common/dungeon/
drwxr-s--- 2 sgeadmin operator 2.3M 2017-03-21 17:42 /home/common/dungeon/

====

d) Attach this context.sh script to SGE

$ qconf -sconf
…
prolog                       sgeadmin@/usr/sge/cluster/busybox env -u 
LD_LIBRARY_PATH -u LD_PRELOAD -u IFS /usr/sge/cluster/context.sh

In case you don't have busybox available, you can define it as 
sgeadmin@/usr/sge/cluster/context.sh The use of busybox is a safety measure, as 
the environment of the job will be available during the execution of the 
context.sh and might change its behavior, but not a prerequisite.

====

e) Remove outdated job scripts after 2 month

On the qmaster machine in /etc/crond./gridengine:

5 2 * * 0       sgeadmin        find /home/common/dungeon/ -mtime +60 -delete

It's also possible to define it for the sgeadmin user directly by `crontab -e`, 
note that in this case the column specifying the user must be left out. 

====

f) Use a custom mailer to send the collected information:

See attached mailer.sh. Also the created context file will be deleted therein.

====

I hope I didn't forget anything.

Once in a while /var/spool/sge/context on the nodes needs to get a 
spring-cleaning.


-- Reuti



_______________________________________________
SGE-discuss mailing list
SGE-discuss@liv.ac.uk
https://arc.liv.ac.uk/mailman/listinfo/sge-discuss

Reply via email to