For those interested, I worked around the issue by switching from qsub to
qrsh, everything seems to work fine so far.

2012/10/18 Reuti <[email protected]>

> Am 21.09.2012 um 18:06 schrieb Julien Nicoulaud:
>
> > Yes, still the same question, I'm trying to get a proper exit code for
> "qsub -sync y" :)
> > When I talk about graceful shutdown, I only talk about the slaves. It
> really seems to me that whatever happens, if the slave tasks are not
> cleanly shut down, qsub will always show this "Unable to run job" message
> and return 0.
>
> I could only think of a wrapper, which will scan for a file which is only
> written at the regular end of the job and set its exit value accordingly.
>
> -- Reuti
>
>
> > 2012/9/21 Reuti <[email protected]>
> > Am 21.09.2012 um 16:13 schrieb Julien Nicoulaud:
> >
> > > I tried to implement the -notify + trap USR2 solution, but could not
> get it to work. I can trap the USR2 signal in the qmaster PE script, but as
> soon as it is sent, the slave tasks get killed, leaving my application no
> time to cleanly shut them down. The qmaster log displays:
> >
> > Is this a new question? Originally you wanted to get a proper exit code
> for -sync y, now to gracefully shut down.
> >
> > -- Reuti
> >
> >
> > > tightly integrated parallel task 61969.1 task 1.computeXX failed -
> killing job
> > >
> > > The queue is configured with "notify 00:00:60", so that should leave
> at least one minute. I also tried to trap USR2 in the PE script and not
> forward it all to child processes, but slave tasks still get killed. Is
> there something else specific to do to avoid this?
> > >
> > > 2012/9/19 Julien Nicoulaud <[email protected]>
> > > Yes, that's what I meant. For me, if control_slaves is FALSE, qsub
> returns with a non-zero exit code after h_rt is elapsed.
> > >
> > >
> > > 2012/9/19 Reuti <[email protected]>
> > > Hi,
> > >
> > > Am 19.09.2012 um 14:36 schrieb Julien Nicoulaud:
> > >
> > > > On SGE 6.2u5, I submit jobs with -sync y and h_rt. When the jobs
> gets killed after the time is elapsed, qsub prints a "Unable to run job"
> message but exists with code 0.  I tried to trap KILL signal
> > > > inside the job script, but it does not seem to affect qsub return
> code. Is it possible to make it return 1 ?
> > > >
> > > > Note: it only behaves this way for jobs running in a tightly
> integrated parallel environment. In a loosely integrated PE, qsub returns 1
> in this case...
> > >
> > > You mean the setting of "control_slaves"? For me it's always 0 if I
> request a PE.
> > >
> > > -- Reuti
> > >
> > >
> >
> >
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to