Am 14.09.2016 um 22:52 schrieb Lee, Wayne: > HI William, > > Thanks for the prompt reply. Apologies for not including more detail with > regards to my query concerning getting Grid Engine to force all jobs with an > exit status other than 0, 99 or 100 to error state (i.e. exit code of 100). > > As I stated in my earlier post our jobs execute an epilog script which is > named "gp_epilog" at the conclusion of the job running on a given execution > host. The "gp_epilog" essentially does the following: > > 1. Obtains the "exit_status" value from the execution host's job spool > directory from a file named "usage". As an example, take a look at the > directory listing below from a test job on an execution host with name > "g00801" where the execution host's spool directory is /tmp/ge/. You then > will see the "usage" file. The contents of the "usage" file is shown below > the directory contents. The "exit_status" in the example below is 137. > > Directory listing of /tmp/ge/g00801/active_jobs/1012.1 > ============================================================ > /tmp/ugev841/g00801/active_jobs/1012.1: > total 48 > drwxr-xr-x 2 sgeadmin adm 4096 Sep 13 13:12 . > drwxr-xr-x 3 sgeadmin adm 4096 Sep 13 13:12 .. > -rw-r--r-- 1 sgeadmin adm 6 Sep 13 13:12 addgrpid > -rw-r--r-- 1 sgeadmin adm 2236 Sep 13 13:12 config > -rw-r--r-- 1 sgeadmin adm 1546 Sep 13 13:12 environment > -rw-r--r-- 1 tdhf781 hougeo 0 Sep 13 13:12 error > -rw-r--r-- 1 tdhf781 hougeo 0 Sep 13 13:12 exit_status > prw-r--r-- 1 sgeadmin adm 0 Sep 13 13:12 fifo_execd_to_shepherd > -rw-r--r-- 1 sgeadmin adm 6 Sep 13 13:12 job_pid > -rw-r--r-- 1 sgeadmin adm 54 Sep 13 13:12 pe_hostfile > -rw-r--r-- 1 sgeadmin adm 6 Sep 13 13:12 pid > -rw-r--r-- 1 tdhf781 hougeo 9095 Sep 13 13:12 trace > -rw-r--r-- 1 sgeadmin adm 324 Sep 13 13:12 usage > > > Contents of Usage file output !!! > ================================= > wait_status=2193 > exit_status=137 > signal=0 > start_time=1473790362804 > end_time=1473790367828 > ru_wallclock=5.024000 > ru_utime=0.004999 > ru_stime=0.001999 > ru_maxrss=1828 > ru_ixrss=0 > ru_idrss=0 > ru_isrss=0 > ru_minflt=3460 > ru_majflt=0 > ru_nswap=0 > ru_inblock=8 > ru_oublock=96 > ru_msgsnd=0 > ru_msgrcv=0 > ru_nsignals=0 > ru_nvcsw=73 > ru_nivcsw=11 > > > 2. Once the value of the "exit_status" is parsed from the "usage" file, the > "gp_epilog" script just does a check to see if the value of "exit_status" > doesn't equal 0, 99 or 100. If it doesn't equal 0, 99 or 100, then the > "gp_epilog" script executes an "exit 100". I'm assuming the "exit_status" > value from the "usage" file is from the application that is from the job/job > tasks that executed on the execution host g00801 from the example I've listed > above. I was thinking that if I issue an "exit 100" from within the > "gp_epilog" script I've got, the job/job task would show up in "error state". > I would see this show up in a "qstat" output with the job/job task showing > a state of "Eqw" or something similar. > > I've performed some tests by submitting a basic shell script which dumps the > environment (i.e. env) and performs either an "exit 0", "exit 99", "exit > 100", "exit 137" other exit status codes. If I set my script to "exit 0", > the job exits normally. If I set my script to "exit 99", then the job gets > requeued for execution and if I set my script to "exit 100", the job goes > into error state. All of these scenarios are what I expect based on the man > pages for "queue_conf". However, I am unable to use any other "exit ##", > trap it and force the job to error state by the method I describe.
This should work. In the `qacct` output you can even see a mixture of the real exit code and the job being rescheduled: $ qacct -j 1083478 ============================================================== qname parallel hostname node17 ... slots 4 failed 30 : rescheduling on application error exit_status 56 ... While the job exiting with 100 would show of course: failed 30 : rescheduling on application error exit_status 100 -- Reuti > I'm not sure if what I'm trying to do makes sense or should I consider a > different way to do what I am attempting. I can look at the > "starter_method" to see if this is a viable way. > > Thanks in advance. > > ----- > Wayne Lee > > > -----Original Message----- > From: William Hay [mailto:w....@ucl.ac.uk] > Sent: Wednesday, September 14, 2016 2:38 AM > To: Lee, Wayne <w...@hess.com> > Cc: users@gridengine.org Group <users@gridengine.org> > Subject: Re: [gridengine users] Forcing Grid Engine jobs to error state with > exit status other than 0, 99 or 100. > > On Tue, Sep 13, 2016 at 06:52:53PM +0000, Lee, Wayne wrote: >> In the epilog script that I've setup for our jobs, I've attempted to >> capture the value of the "exit_status" of a job or job task and if it >> isn't 0, 99 or 100, exit the epilog script with an "exit 100". However >> this doesn't appear to work. > > In general when describing an issue or problem it is more helpful to describe > what does happen than what doesn't. The number of things that didn't happen > when you made the epilog script exit 100 is almost infinite. > >> >> >> >> Anyway way of stating what I'm trying to convey is if the exit status a >> job or job task is anything other than 0, 99 or 100 put the job in error >> state. If this can be done, then we would know that a job didn't >> complete correctly and if it is in Eqw state we have the option of >> clearing error state (i.e. qmod -cj) and re-executing the job again. > > One possibility would be to write a starter_method that wraps the real job > and does an exit 100 when the job terminates with an exit status other than 0 > or 99. > > William > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users