Re: [gridengine users] Forcing Grid Engine jobs to error state with exit status other than 0, 99 or 100.

Reuti Wed, 14 Sep 2016 14:31:31 -0700

Am 14.09.2016 um 22:52 schrieb Lee, Wayne:

> HI William,
> 
> Thanks for the prompt reply.   Apologies for not including more detail with 
> regards to my query concerning getting Grid Engine to force all jobs with an 
> exit status other than 0, 99 or 100 to error state (i.e. exit code of 100).
> 
> As I stated in my earlier post our jobs execute an epilog script which is 
> named "gp_epilog" at the conclusion of the job running on a given execution 
> host.    The "gp_epilog" essentially does the following:
> 
> 1.  Obtains the "exit_status" value from the execution host's job spool 
> directory from a file named "usage".   As an example, take a look at the 
> directory listing below from a test job on an execution host with name 
> "g00801" where the execution host's spool directory is /tmp/ge/.  You then 
> will see the "usage" file.    The contents of the "usage" file is shown below 
> the directory contents.    The "exit_status" in the example below is 137.
> 
> Directory listing of /tmp/ge/g00801/active_jobs/1012.1
> ============================================================
> /tmp/ugev841/g00801/active_jobs/1012.1:
> total 48
> drwxr-xr-x 2 sgeadmin adm    4096 Sep 13 13:12 .
> drwxr-xr-x 3 sgeadmin adm    4096 Sep 13 13:12 ..
> -rw-r--r-- 1 sgeadmin adm       6 Sep 13 13:12 addgrpid
> -rw-r--r-- 1 sgeadmin adm    2236 Sep 13 13:12 config
> -rw-r--r-- 1 sgeadmin adm    1546 Sep 13 13:12 environment
> -rw-r--r-- 1 tdhf781  hougeo    0 Sep 13 13:12 error
> -rw-r--r-- 1 tdhf781  hougeo    0 Sep 13 13:12 exit_status
> prw-r--r-- 1 sgeadmin adm       0 Sep 13 13:12 fifo_execd_to_shepherd
> -rw-r--r-- 1 sgeadmin adm       6 Sep 13 13:12 job_pid
> -rw-r--r-- 1 sgeadmin adm      54 Sep 13 13:12 pe_hostfile
> -rw-r--r-- 1 sgeadmin adm       6 Sep 13 13:12 pid
> -rw-r--r-- 1 tdhf781  hougeo 9095 Sep 13 13:12 trace
> -rw-r--r-- 1 sgeadmin adm     324 Sep 13 13:12 usage
> 
> 
> Contents of Usage file output !!!
> =================================
> wait_status=2193
> exit_status=137
> signal=0
> start_time=1473790362804
> end_time=1473790367828
> ru_wallclock=5.024000
> ru_utime=0.004999
> ru_stime=0.001999
> ru_maxrss=1828
> ru_ixrss=0
> ru_idrss=0
> ru_isrss=0
> ru_minflt=3460
> ru_majflt=0
> ru_nswap=0
> ru_inblock=8
> ru_oublock=96
> ru_msgsnd=0
> ru_msgrcv=0
> ru_nsignals=0
> ru_nvcsw=73
> ru_nivcsw=11
> 
> 
> 2. Once the value of the "exit_status" is parsed from the "usage" file, the 
> "gp_epilog" script just does a check to see if the value of "exit_status" 
> doesn't equal 0, 99 or 100.    If it doesn't equal 0, 99 or 100, then the 
> "gp_epilog" script executes an "exit 100".    I'm assuming the "exit_status" 
> value from the "usage" file is from the application that is from the job/job 
> tasks that executed on the execution host g00801 from the example I've listed 
> above.    I was thinking that if I issue an "exit 100" from within the 
> "gp_epilog" script I've got, the job/job task would show up in "error state". 
>   I would see this show up in a "qstat" output with the job/job task showing 
> a state of "Eqw" or something similar.   
> 
> I've performed some tests by submitting a basic shell script which dumps the 
> environment (i.e. env) and performs either an "exit 0", "exit 99", "exit 
> 100", "exit 137" other exit status codes.    If I set my script to "exit 0", 
> the job exits normally.   If I set my script to "exit 99", then the job gets 
> requeued for execution and if I set my script to "exit 100", the job goes 
> into error state.   All of these scenarios are what I expect based on the man 
> pages for "queue_conf".   However, I am unable to use any other "exit ##", 
> trap it and force the job to error state by the method I describe.


This should work. In the `qacct` output you can even see a mixture of the real 
exit code and the job being rescheduled:

$ qacct -j 1083478
==============================================================
qname        parallel            
hostname     node17              
...
slots        4                   
failed       30  : rescheduling on application error
exit_status  56
...

While the job exiting with 100 would show of course:

failed       30  : rescheduling on application error
exit_status  100


-- Reuti


> I'm not sure if what I'm trying to do makes sense or should I consider a 
> different way to do what I am attempting.   I can look at the 
> "starter_method" to see if this is a viable way.
> 
> Thanks in advance.
> 
> -----
> Wayne Lee
> 
> 
> -----Original Message-----
> From: William Hay [mailto:w....@ucl.ac.uk] 
> Sent: Wednesday, September 14, 2016 2:38 AM
> To: Lee, Wayne <w...@hess.com>
> Cc: users@gridengine.org Group <users@gridengine.org>
> Subject: Re: [gridengine users] Forcing Grid Engine jobs to error state with 
> exit status other than 0, 99 or 100.
> 
> On Tue, Sep 13, 2016 at 06:52:53PM +0000, Lee, Wayne wrote:
>>   In the epilog script that I've setup for our jobs, I've attempted to
>>   capture the value of the "exit_status" of a job or job task and if it
>>   isn't 0, 99 or 100, exit the epilog script with an "exit 100".   However
>>   this doesn't appear to work.  
> 
> In general when describing an issue or problem it is more helpful to describe 
> what does happen than what doesn't.  The number of things that didn't happen 
> when you made the epilog script exit 100 is almost infinite.
> 
>> 
>> 
>> 
>>   Anyway way of stating what I'm trying to convey is if the exit status a
>>   job or job task is anything other than 0, 99 or 100 put the job in error
>>   state.      If this can be done, then we would know that a job didn't
>>   complete correctly and if it is in Eqw state we have the option of
>>   clearing error state (i.e. qmod -cj) and re-executing the job again.
> 
> One possibility would be to write a starter_method that wraps the real job 
> and does an exit 100 when the job terminates with an exit status other than 0 
> or 99. 
> 
> William 
> 
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Forcing Grid Engine jobs to error state with exit status other than 0, 99 or 100.

Reply via email to