Another thing you could do, if you have access to the accounting file
or db from the nodes, is to call qacct -j <pipeline_job_id> from the
completion_job and capture the 'failed' and 'exit_status' fields.
This way you can tell if a job failed or succeed even if the job
crashed and didn't produce any error output.

I don't know of a way to do this with the job id of the pipeline
process.  This value doesn't appear to be passed to the completion_job
via an environment value (I'm happy to be wrong.  Would solve issues
for me), but it is visible from the output and error files which have
been redirected to a directory.  You could do something like

IFS='\n' read -d '' -r -a jobid_arr < <(ls ${HOME}/err | cut -d'.' -f1
| sort | uniq)

for jobid in ${jobid_arr[@]}
do
  qacct -j ${jobid} | awk '
    BEGIN {
      jobnumber=0
      taskid=0
      failed=0
      exit_status=0
    }
    $1 ~ /^jobnumber$/ { jobnumber=$2 }
    $1 ~ /^taskid$/ && $2 !~ /^undefined$/ { taskid=$2 }
    $1 ~ /^failed$/ { failed=$2 }
    $1 ~ /^exit_status$/ { exit_status=$2
      if (failed != 0) {
        print "Job Number: " jobnumber
        if (taskid != 0) { print "Task ID: " taskid }
       print "Failed: " failed
        print "Exit Status: " exit_status
      }
      jobnumber=0
      taskid=0
      failed=0
      exit_status=0
    }'
done

And that will print out all your jobs that failed.

  John.





On Wed, Jul 16, 2014 at 9:58 AM, John Kloss <john.kl...@gmail.com> wrote:
> I missed your second email to Txema, but I think a simple, if less
> than elegant method is to redirect stdout and stderr of the pipeline
> jobs to a directory.  Each output file and error file can be
> identified by its job id or task id.  Then, in your clean up code you
> can sweep through each directory and report which jobs succeeded and
> which jobs failed based on the size and existence of each file.
>
> So you could submit as
>
> qsub -m e -M <user-email> -hold_jid $(qsub -terse -e
> ${HOME}/err/$JOB_ID -o ${HOME}/out/$JOB_ID pipeline_job)
> completion_job
>
> And then in the completion_job you can do something like
>
> IFS='\n' read -d '' -r -a jobid_arr < <(find ${HOME}/err/ -type f -size 0)
> echo "Following jobs failed: ${jobid_arr[@]}"
>
> And mail that.
>
> The directory approach means you don't have to worry about locking
> like you would if each job were writing to a file.  A socket won't
> work across multiple hosts.  You'd have to use nc or something more
> complex.
>
>   John.
>
>
> On Wed, Jul 16, 2014 at 7:17 AM, John Kloss <john.kl...@gmail.com> wrote:
>> You can submit a job that has a hold placed on it based on your
>> pipeline, whose only purpose is to email you when your pipeline
>> finishes.
>>
>> qsub -m e -M <user-email> -hold_jid $(qsub -terse pipeline_job) 
>> completion_job
>>
>> or, with an array job, which submits job ids as
>> <jobid>.1-<array_size>:?, you can hold on the whole array job
>>
>> qsub -m e -M <user-email> -hold_jid $(qsub -terse -t 1-<pipeline #>
>> pipeline_job | cut -d'.' -f1) completion_job
>>
>> I often do clean up and check for error files in the completion_job.
>>
>>   John.
>>
>> On Wed, Jul 16, 2014 at 6:05 AM, Paolo Di Tommaso
>> <paolo.ditomm...@gmail.com> wrote:
>>> Hi Txema,
>>>
>>> My point is not to disable them but how to get the notification by using a
>>> different "transport" other than email. I would like that information to a
>>> file or a socket.
>>>
>>> p
>>>
>>>
>>> On Wed, Jul 16, 2014 at 11:55 AM, Txema Heredia <txema.llis...@gmail.com>
>>> wrote:
>>>>
>>>> Hi Paolo,
>>>>
>>>> you can disable mails on all but the very last job of the pipeline by
>>>> using -m b|e|a|s|n.
>>>>
>>>> There has been discussions on the list on mechanisms to send all emails to
>>>> the local linux user (not an email address) and send "packages of mails"
>>>> every day or so, but I can't remember any definitive way to work this out.
>>>>
>>>> Txema
>>>>
>>>> El 16/07/14 11:30, Paolo Di Tommaso escribió:
>>>>
>>>> Hi,
>>>>
>>>> SGE can send job notification via email by using the -M command line
>>>> option.
>>>>
>>>> This useful when you are submitting few jobs but not for complex pipeline
>>>> crunching thousand of jobs.
>>>>
>>>> I'm wondering if SGE can send these notifications by using other mechanism
>>>> e.g. writing to a file, socket, http, etc.
>>>>
>>>>
>>>> Thanks,
>>>> Paolo
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users@gridengine.org
>>>> https://gridengine.org/mailman/listinfo/users
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users@gridengine.org
>>>> https://gridengine.org/mailman/listinfo/users
>>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
>>>

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to