Hello,

Thank you for the suggestion, Reuti. Not sure if my users' pipelines can
deal with multiple job ids, perhaps they will be willing to modify their
code.

On Mon, Jun 11, 2018 at 9:23 AM, Reuti <re...@staff.uni-marburg.de> wrote:

> Hi,
>
>
> I wouldn't be surprised if the execd remembers that the job was already
> warned, hence it must be the hard limit now. Would your workflow allow:
>
> This is happening on different nodes, so each execd cannot know any
history by itself, the master must be providing this information. Can't
help wondering if this is a configurable option.

Ilya.




> . /usr/sge/default/common/settings.sh
> trap "qresub $JOB_ID; exit 4;" SIGUSR1
>
> Well, you get several job numbers this way. For the accounting with
> `qacct` you could use the job name instead of the job number to get all the
> runs listed though.
>
> -- Reuti
>
>
> > This is my test script:
> >
> > #!/bin/bash
> >
> > #$ -S /bin/bash
> > #$ -l s_rt=0:0:5,h_rt=0:0:10
> > #$ -j y
> >
> > set -x
> > set -e
> > set -o pipefail
> > set -u
> >
> > trap "exit 99" SIGUSR1
> >
> > trap "exit 2" SIGTERM
> >
> > echo "hello world"
> >
> > sleep 15
> >
> > It should reschedule itself indefinitely when s_rt lapses. Yet, what is
> happening is that rescheduling happens only once. On the second run the job
> receives only SIGTERM and exits. Here is the script's output:
> >
> > node140
> > + set -e
> > + set -o pipefail
> > + set -u
> > + trap 'exit 99' SIGUSR1
> > + trap 'exit 2' SIGTERM
> > + echo 'hello world'
> > hello world
> > + sleep 15
> > User defined signal 1
> > ++ exit 99
> > node069
> > + set -e
> > + set -o pipefail
> > + set -u
> > + trap 'exit 99' SIGUSR1
> > + trap 'exit 2' SIGTERM
> > + echo 'hello world'
> > hello world
> > + sleep 15
> > Terminated
> > ++ exit 2
> >
> > Execd logs confirms that for the second time the jobs was killed for
> exceeding h_rt:
> >
> > 06/08/2018 21:20:15|  main|node140|W|job 8030395.1 exceeded soft
> wallclock time - initiate soft notify method
> > 06/08/2018 21:20:59|  main|node140|E|shepherd of job 8030395.1 exited
> with exit status = 25
> >
> > 06/08/2018 21:21:45|  main|node069|W|job 8030395.1 exceeded hard
> wallclock time - initiate terminate method
> >
> > And here is the accounting information:
> >
> > ==============================================================
> > qname        short.q
> > hostname     node140
> > group        everyone
> > owner        ilya
> > project      project.p
> > department   defaultdepartment
> > jobname      reshed_test.sh
> > jobnumber    8030395
> > taskid       undefined
> > account      sge
> > priority     0
> > qsub_time    Fri Jun  8 21:19:40 2018
> > start_time   Fri Jun  8 21:20:09 2018
> > end_time     Fri Jun  8 21:20:15 2018
> > granted_pe   NONE
> > slots        1
> > failed       25  : rescheduling
> > exit_status  99
> > ru_wallclock 6
> > ...
> > ==============================================================
> > qname        short.q
> > hostname     node069
> > group        everyone
> > owner        ilya
> > project      project.p
> > department   defaultdepartment
> > jobname      reshed_test.sh
> > jobnumber    8030395
> > taskid       undefined
> > account      sge
> > priority     0
> > qsub_time    Fri Jun  8 21:19:40 2018
> > start_time   Fri Jun  8 21:21:39 2018
> > end_time     Fri Jun  8 21:21:50 2018
> > granted_pe   NONE
> > slots        1
> > failed       0
> > exit_status  2
> > ru_wallclock 11
> > ...
> >
> >
> > Is there anything in the configuration I could be missing. Running 6.2u5.
> >
> > Thank you,
> > Ilya.
> >
> > _______________________________________________
> > users mailing list
> > users@gridengine.org
> > https://gridengine.org/mailman/listinfo/users
>
>
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to