On Fri, Jun 15, 2012 at 12:31 PM, Rayson Ho <[email protected]> wrote:
> On Fri, Jun 15, 2012 at 1:46 PM, Michael Coffman > <[email protected]> wrote: > > Also might be of interest: > > Thanks... Also, any messages in the execd "messages" file?? > > 06/14/2012 08:56:49| main|cs431|E|shepherd of job 9990340.1 exited with exit status = 11 > Rayson > > > > > > > > ============================================================== > > qname all.q > > hostname cs431.ftc.avagotech.net > > group fidlib > > owner bgp > > project NONE > > department priority > > jobname qsubcmd.21231 > > jobnumber 17593 > > taskid undefined > > account sge > > priority 0 > > qsub_time Wed Dec 31 17:00:00 1969 > > start_time -/- > > end_time -/- > > granted_pe NONE > > slots 0 > > failed 11 : before job > > exit_status 0 > > ru_wallclock 0 > > ru_utime 0.000 > > ru_stime 0.000 > > ru_maxrss 0 > > ru_ixrss 0 > > ru_ismrss 0 > > ru_idrss 0 > > ru_isrss 0 > > ru_minflt 0 > > ru_majflt 0 > > ru_nswap 0 > > ru_inblock 0 > > ru_oublock 0 > > ru_msgsnd 0 > > ru_msgrcv 0 > > ru_nsignals 0 > > ru_nvcsw 0 > > ru_nivcsw 0 > > cpu 0.000 > > mem 0.000 > > io 0.000 > > iow 0.000 > > maxvmem 0.000 > > arid undefined > > > > > > > > On Fri, Jun 15, 2012 at 11:27 AM, Michael Coffman > > <[email protected]> wrote: > >> > >> On Fri, Jun 15, 2012 at 11:11 AM, Rayson Ho <[email protected]> > wrote: > >>> > >>> Can you set "execd_params" to KEEP_ACTIVE for this host?? (See the > >>> manpage at this URL: > >>> http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html ) > >>> > >>> Request the job to run in this queue/host again, and see why the > >>> shepherd can't open the job_pid. > >>> > >>> (And remember to unset the execd_params or else you will fill up your > >>> local spool dir eventually with job information.) > >>> > >> > >> I can't do this on my production grid. And I don't know how to > replicate > >> the problem currently. I will set things up on a test setup and try > and > >> reproduce the issue with KEEP_ACTIVE turned on. > >> > >> Is it possible to set the KEEP_ACTIVE per host? I only see this in the > >> qconf -sconf > >> > >>> > >>> Rayson > >>> > >>> > >>> > >>> On Fri, Jun 15, 2012 at 12:58 PM, Michael Coffman > >>> <[email protected]> wrote: > >>> > On Fri, Jun 15, 2012 at 10:11 AM, Rayson Ho <[email protected]> > >>> > wrote: > >>> >> > >>> >> On Fri, Jun 15, 2012 at 12:01 PM, Michael Coffman > >>> >> <[email protected]> wrote: > >>> >> > From the qmaster messages file: > >>> >> > 06/14/2012 21:29:39|worker|gemaster|W|job 3885.1 failed on host > >>> >> > cs428.ftc.avagotech.net general before job because: 06/14/2012 > >>> >> > 21:29:37 > >>> >> > [20339:8436]: can't open file job_pid: Permission denied > >>> >> > > >>> >> > I checked a job_pid file on a currently running job on the system > >>> >> > that > >>> >> > had > >>> >> > the above errors, permission down the entire tree seems fine and > >>> >> > here is > >>> >> > the > >>> >> > job_id file: > >>> >> > > >>> >> > -rw-r--r-- 1 grid grid 6 Jun 14 17:40 job_pid > >>> >> > >>> >> Is your execd spool dir on NFS or local?? > >>> >> > >>> > Local. > >>> > > >>> >> > >>> >> Also, does it happen to all nodes or just a node or queue? > >>> >> > >>> > > >>> > Happened on 2 different nodes. Not all jobs caused this. > >>> > > >>> >> > >>> >> Rayson > >>> >> > >>> >> > >>> >> > >>> >> > > >>> >> > Any clues? Is the path perhaps hard coded into sge_shepherd for > >>> >> > this > >>> >> > file? > >>> >> > > >>> >> > Thanks. > >>> >> > -- > >>> >> > -MichaelC > >>> >> > > >>> >> > _______________________________________________ > >>> >> > users mailing list > >>> >> > [email protected] > >>> >> > https://gridengine.org/mailman/listinfo/users > >>> >> > > >>> > > >>> > > >>> > > >>> > > >>> > -- > >>> > -MichaelC > >> > >> > >> > >> > >> -- > >> -MichaelC > > > > > > > > > > -- > > -MichaelC > -- -MichaelC
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
