I changed the subject, as this is a setup which might be advantageous for other users too.
The constraint one faces with changing and adding information to the email send from the exechost to the user is, that the email is send after the job left already the exechost and GridEngine. So there is nothing left which could be scanned for relevant information besides the messages file of the exechost. If this is information is sufficient (and no access to the job's context required), one can skip to item f) below. All paths need to be adjusted to reflect the structure you have in your cluster (also in the two supplied scripts). ==== a) Record the issued command in a job context: To implement this feature we need two definitions in the Bash profile: qsub-helper() { (export AC="`history 1 | perl -pe 's/^ *[0-9]+ +[^ ]+ ?//'`"; eval /usr/sge/bin/lx24-em64t/qsub -ac "COMMAND='qsub $AC'" "$AC") } alias qsub='qsub-helper #' (I got this hint on the Bash mailing list pointing to http://www.chiark.greenend.org.uk/~sgtatham/aliases.html) In case you wonder about the lx24-em64t while you might need lx24-amd64 here: in the beginning we had a mixture of Intel and AMD machines, and to target the correct binary I adjusted the `arch` script to behave accordingly and used $ARC in the job scripts while the directory trees below were congruent for type of CPU. ==== b) Prepare to store the context of a job after he finished: On all exechosts we need a directory, which is writable by the sgeadmin user (it could be in a shared directory too, I prefer it on each node): mkdir -p /var/spool/sge/context/ chown -R sgeadmin:gridware /var/spool/sge In case there is already a local spool directory, this can be used too. $ ls -lhd /var/spool/sge/context/ drwxr-xr-x 2 sgeadmin gridware 20K Mar 21 18:20 /var/spool/sge/context/ ==== c) Define a script recording the given job context Please find this script context.sh attached to this email. You may note, that for adding context information to the job it's necessary that the exechost is also a submission host. In case you don't need the list of used nodes in the email, these lines can be deleted and the need to be a submission host does not exist. There is also a common directory accessible only for admin stuff /home/common/dungeon defined, where the job scripts are copied to too. If you don't like or don't need this, it can also be removed from the script. $ ls -lhd /home/common/dungeon/ drwxr-s--- 2 sgeadmin operator 2.3M 2017-03-21 17:42 /home/common/dungeon/ ==== d) Attach this context.sh script to SGE $ qconf -sconf … prolog sgeadmin@/usr/sge/cluster/busybox env -u LD_LIBRARY_PATH -u LD_PRELOAD -u IFS /usr/sge/cluster/context.sh In case you don't have busybox available, you can define it as sgeadmin@/usr/sge/cluster/context.sh The use of busybox is a safety measure, as the environment of the job will be available during the execution of the context.sh and might change its behavior, but not a prerequisite. ==== e) Remove outdated job scripts after 2 month On the qmaster machine in /etc/crond./gridengine: 5 2 * * 0 sgeadmin find /home/common/dungeon/ -mtime +60 -delete It's also possible to define it for the sgeadmin user directly by `crontab -e`, note that in this case the column specifying the user must be left out. ==== f) Use a custom mailer to send the collected information: See attached mailer.sh. Also the created context file will be deleted therein. ==== I hope I didn't forget anything. Once in a while /var/spool/sge/context on the nodes needs to get a spring-cleaning. -- Reuti
_______________________________________________ SGE-discuss mailing list SGE-discuss@liv.ac.uk https://arc.liv.ac.uk/mailman/listinfo/sge-discuss