Hello, Thank you for the suggestion, Reuti. Not sure if my users' pipelines can deal with multiple job ids, perhaps they will be willing to modify their code.
On Mon, Jun 11, 2018 at 9:23 AM, Reuti <re...@staff.uni-marburg.de> wrote: > Hi, > > > I wouldn't be surprised if the execd remembers that the job was already > warned, hence it must be the hard limit now. Would your workflow allow: > > This is happening on different nodes, so each execd cannot know any history by itself, the master must be providing this information. Can't help wondering if this is a configurable option. Ilya. > . /usr/sge/default/common/settings.sh > trap "qresub $JOB_ID; exit 4;" SIGUSR1 > > Well, you get several job numbers this way. For the accounting with > `qacct` you could use the job name instead of the job number to get all the > runs listed though. > > -- Reuti > > > > This is my test script: > > > > #!/bin/bash > > > > #$ -S /bin/bash > > #$ -l s_rt=0:0:5,h_rt=0:0:10 > > #$ -j y > > > > set -x > > set -e > > set -o pipefail > > set -u > > > > trap "exit 99" SIGUSR1 > > > > trap "exit 2" SIGTERM > > > > echo "hello world" > > > > sleep 15 > > > > It should reschedule itself indefinitely when s_rt lapses. Yet, what is > happening is that rescheduling happens only once. On the second run the job > receives only SIGTERM and exits. Here is the script's output: > > > > node140 > > + set -e > > + set -o pipefail > > + set -u > > + trap 'exit 99' SIGUSR1 > > + trap 'exit 2' SIGTERM > > + echo 'hello world' > > hello world > > + sleep 15 > > User defined signal 1 > > ++ exit 99 > > node069 > > + set -e > > + set -o pipefail > > + set -u > > + trap 'exit 99' SIGUSR1 > > + trap 'exit 2' SIGTERM > > + echo 'hello world' > > hello world > > + sleep 15 > > Terminated > > ++ exit 2 > > > > Execd logs confirms that for the second time the jobs was killed for > exceeding h_rt: > > > > 06/08/2018 21:20:15| main|node140|W|job 8030395.1 exceeded soft > wallclock time - initiate soft notify method > > 06/08/2018 21:20:59| main|node140|E|shepherd of job 8030395.1 exited > with exit status = 25 > > > > 06/08/2018 21:21:45| main|node069|W|job 8030395.1 exceeded hard > wallclock time - initiate terminate method > > > > And here is the accounting information: > > > > ============================================================== > > qname short.q > > hostname node140 > > group everyone > > owner ilya > > project project.p > > department defaultdepartment > > jobname reshed_test.sh > > jobnumber 8030395 > > taskid undefined > > account sge > > priority 0 > > qsub_time Fri Jun 8 21:19:40 2018 > > start_time Fri Jun 8 21:20:09 2018 > > end_time Fri Jun 8 21:20:15 2018 > > granted_pe NONE > > slots 1 > > failed 25 : rescheduling > > exit_status 99 > > ru_wallclock 6 > > ... > > ============================================================== > > qname short.q > > hostname node069 > > group everyone > > owner ilya > > project project.p > > department defaultdepartment > > jobname reshed_test.sh > > jobnumber 8030395 > > taskid undefined > > account sge > > priority 0 > > qsub_time Fri Jun 8 21:19:40 2018 > > start_time Fri Jun 8 21:21:39 2018 > > end_time Fri Jun 8 21:21:50 2018 > > granted_pe NONE > > slots 1 > > failed 0 > > exit_status 2 > > ru_wallclock 11 > > ... > > > > > > Is there anything in the configuration I could be missing. Running 6.2u5. > > > > Thank you, > > Ilya. > > > > _______________________________________________ > > users mailing list > > users@gridengine.org > > https://gridengine.org/mailman/listinfo/users > >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users