Another thing you could do, if you have access to the accounting file or db from the nodes, is to call qacct -j <pipeline_job_id> from the completion_job and capture the 'failed' and 'exit_status' fields. This way you can tell if a job failed or succeed even if the job crashed and didn't produce any error output.
I don't know of a way to do this with the job id of the pipeline process. This value doesn't appear to be passed to the completion_job via an environment value (I'm happy to be wrong. Would solve issues for me), but it is visible from the output and error files which have been redirected to a directory. You could do something like IFS='\n' read -d '' -r -a jobid_arr < <(ls ${HOME}/err | cut -d'.' -f1 | sort | uniq) for jobid in ${jobid_arr[@]} do qacct -j ${jobid} | awk ' BEGIN { jobnumber=0 taskid=0 failed=0 exit_status=0 } $1 ~ /^jobnumber$/ { jobnumber=$2 } $1 ~ /^taskid$/ && $2 !~ /^undefined$/ { taskid=$2 } $1 ~ /^failed$/ { failed=$2 } $1 ~ /^exit_status$/ { exit_status=$2 if (failed != 0) { print "Job Number: " jobnumber if (taskid != 0) { print "Task ID: " taskid } print "Failed: " failed print "Exit Status: " exit_status } jobnumber=0 taskid=0 failed=0 exit_status=0 }' done And that will print out all your jobs that failed. John. On Wed, Jul 16, 2014 at 9:58 AM, John Kloss <john.kl...@gmail.com> wrote: > I missed your second email to Txema, but I think a simple, if less > than elegant method is to redirect stdout and stderr of the pipeline > jobs to a directory. Each output file and error file can be > identified by its job id or task id. Then, in your clean up code you > can sweep through each directory and report which jobs succeeded and > which jobs failed based on the size and existence of each file. > > So you could submit as > > qsub -m e -M <user-email> -hold_jid $(qsub -terse -e > ${HOME}/err/$JOB_ID -o ${HOME}/out/$JOB_ID pipeline_job) > completion_job > > And then in the completion_job you can do something like > > IFS='\n' read -d '' -r -a jobid_arr < <(find ${HOME}/err/ -type f -size 0) > echo "Following jobs failed: ${jobid_arr[@]}" > > And mail that. > > The directory approach means you don't have to worry about locking > like you would if each job were writing to a file. A socket won't > work across multiple hosts. You'd have to use nc or something more > complex. > > John. > > > On Wed, Jul 16, 2014 at 7:17 AM, John Kloss <john.kl...@gmail.com> wrote: >> You can submit a job that has a hold placed on it based on your >> pipeline, whose only purpose is to email you when your pipeline >> finishes. >> >> qsub -m e -M <user-email> -hold_jid $(qsub -terse pipeline_job) >> completion_job >> >> or, with an array job, which submits job ids as >> <jobid>.1-<array_size>:?, you can hold on the whole array job >> >> qsub -m e -M <user-email> -hold_jid $(qsub -terse -t 1-<pipeline #> >> pipeline_job | cut -d'.' -f1) completion_job >> >> I often do clean up and check for error files in the completion_job. >> >> John. >> >> On Wed, Jul 16, 2014 at 6:05 AM, Paolo Di Tommaso >> <paolo.ditomm...@gmail.com> wrote: >>> Hi Txema, >>> >>> My point is not to disable them but how to get the notification by using a >>> different "transport" other than email. I would like that information to a >>> file or a socket. >>> >>> p >>> >>> >>> On Wed, Jul 16, 2014 at 11:55 AM, Txema Heredia <txema.llis...@gmail.com> >>> wrote: >>>> >>>> Hi Paolo, >>>> >>>> you can disable mails on all but the very last job of the pipeline by >>>> using -m b|e|a|s|n. >>>> >>>> There has been discussions on the list on mechanisms to send all emails to >>>> the local linux user (not an email address) and send "packages of mails" >>>> every day or so, but I can't remember any definitive way to work this out. >>>> >>>> Txema >>>> >>>> El 16/07/14 11:30, Paolo Di Tommaso escribió: >>>> >>>> Hi, >>>> >>>> SGE can send job notification via email by using the -M command line >>>> option. >>>> >>>> This useful when you are submitting few jobs but not for complex pipeline >>>> crunching thousand of jobs. >>>> >>>> I'm wondering if SGE can send these notifications by using other mechanism >>>> e.g. writing to a file, socket, http, etc. >>>> >>>> >>>> Thanks, >>>> Paolo >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@gridengine.org >>>> https://gridengine.org/mailman/listinfo/users >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@gridengine.org >>>> https://gridengine.org/mailman/listinfo/users >>>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> users@gridengine.org >>> https://gridengine.org/mailman/listinfo/users >>> _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users