HI William,

Thanks for the prompt reply.   Apologies for not including more detail with 
regards to my query concerning getting Grid Engine to force all jobs with an 
exit status other than 0, 99 or 100 to error state (i.e. exit code of 100).

As I stated in my earlier post our jobs execute an epilog script which is named 
"gp_epilog" at the conclusion of the job running on a given execution host.    
The "gp_epilog" essentially does the following:

1.  Obtains the "exit_status" value from the execution host's job spool 
directory from a file named "usage".   As an example, take a look at the 
directory listing below from a test job on an execution host with name "g00801" 
where the execution host's spool directory is /tmp/ge/.  You then will see the 
"usage" file.    The contents of the "usage" file is shown below the directory 
contents.    The "exit_status" in the example below is 137.

Directory listing of /tmp/ge/g00801/active_jobs/1012.1
============================================================
/tmp/ugev841/g00801/active_jobs/1012.1:
total 48
drwxr-xr-x 2 sgeadmin adm    4096 Sep 13 13:12 .
drwxr-xr-x 3 sgeadmin adm    4096 Sep 13 13:12 ..
-rw-r--r-- 1 sgeadmin adm       6 Sep 13 13:12 addgrpid
-rw-r--r-- 1 sgeadmin adm    2236 Sep 13 13:12 config
-rw-r--r-- 1 sgeadmin adm    1546 Sep 13 13:12 environment
-rw-r--r-- 1 tdhf781  hougeo    0 Sep 13 13:12 error
-rw-r--r-- 1 tdhf781  hougeo    0 Sep 13 13:12 exit_status
prw-r--r-- 1 sgeadmin adm       0 Sep 13 13:12 fifo_execd_to_shepherd
-rw-r--r-- 1 sgeadmin adm       6 Sep 13 13:12 job_pid
-rw-r--r-- 1 sgeadmin adm      54 Sep 13 13:12 pe_hostfile
-rw-r--r-- 1 sgeadmin adm       6 Sep 13 13:12 pid
-rw-r--r-- 1 tdhf781  hougeo 9095 Sep 13 13:12 trace
-rw-r--r-- 1 sgeadmin adm     324 Sep 13 13:12 usage


Contents of Usage file output !!!
=================================
wait_status=2193
exit_status=137
signal=0
start_time=1473790362804
end_time=1473790367828
ru_wallclock=5.024000
ru_utime=0.004999
ru_stime=0.001999
ru_maxrss=1828
ru_ixrss=0
ru_idrss=0
ru_isrss=0
ru_minflt=3460
ru_majflt=0
ru_nswap=0
ru_inblock=8
ru_oublock=96
ru_msgsnd=0
ru_msgrcv=0
ru_nsignals=0
ru_nvcsw=73
ru_nivcsw=11


2. Once the value of the "exit_status" is parsed from the "usage" file, the 
"gp_epilog" script just does a check to see if the value of "exit_status" 
doesn't equal 0, 99 or 100.    If it doesn't equal 0, 99 or 100, then the 
"gp_epilog" script executes an "exit 100".    I'm assuming the "exit_status" 
value from the "usage" file is from the application that is from the job/job 
tasks that executed on the execution host g00801 from the example I've listed 
above.    I was thinking that if I issue an "exit 100" from within the 
"gp_epilog" script I've got, the job/job task would show up in "error state".   
I would see this show up in a "qstat" output with the job/job task showing a 
state of "Eqw" or something similar.   

I've performed some tests by submitting a basic shell script which dumps the 
environment (i.e. env) and performs either an "exit 0", "exit 99", "exit 100", 
"exit 137" other exit status codes.    If I set my script to "exit 0", the job 
exits normally.   If I set my script to "exit 99", then the job gets requeued 
for execution and if I set my script to "exit 100", the job goes into error 
state.   All of these scenarios are what I expect based on the man pages for 
"queue_conf".   However, I am unable to use any other "exit ##", trap it and 
force the job to error state by the method I describe.  

I'm not sure if what I'm trying to do makes sense or should I consider a 
different way to do what I am attempting.   I can look at the "starter_method" 
to see if this is a viable way.

Thanks in advance.

-----
Wayne Lee


-----Original Message-----
From: William Hay [mailto:w....@ucl.ac.uk] 
Sent: Wednesday, September 14, 2016 2:38 AM
To: Lee, Wayne <w...@hess.com>
Cc: users@gridengine.org Group <users@gridengine.org>
Subject: Re: [gridengine users] Forcing Grid Engine jobs to error state with 
exit status other than 0, 99 or 100.

On Tue, Sep 13, 2016 at 06:52:53PM +0000, Lee, Wayne wrote:
>    In the epilog script that I've setup for our jobs, I've attempted to
>    capture the value of the "exit_status" of a job or job task and if it
>    isn't 0, 99 or 100, exit the epilog script with an "exit 100".   However
>    this doesn't appear to work.  

In general when describing an issue or problem it is more helpful to describe 
what does happen than what doesn't.  The number of things that didn't happen 
when you made the epilog script exit 100 is almost infinite.

> 
>     
> 
>    Anyway way of stating what I'm trying to convey is if the exit status a
>    job or job task is anything other than 0, 99 or 100 put the job in error
>    state.      If this can be done, then we would know that a job didn't
>    complete correctly and if it is in Eqw state we have the option of
>    clearing error state (i.e. qmod -cj) and re-executing the job again.

One possibility would be to write a starter_method that wraps the real job and 
does an exit 100 when the job terminates with an exit status other than 0 or 
99. 

William 

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to