Am 29.10.2012 um 17:12 schrieb Julien Nicoulaud: > For those interested, I worked around the issue by switching from qsub to > qrsh, everything seems to work fine so far.
Thx for sharing this info. -- Reuti > 2012/10/18 Reuti <[email protected]> > Am 21.09.2012 um 18:06 schrieb Julien Nicoulaud: > > > Yes, still the same question, I'm trying to get a proper exit code for > > "qsub -sync y" :) > > When I talk about graceful shutdown, I only talk about the slaves. It > > really seems to me that whatever happens, if the slave tasks are not > > cleanly shut down, qsub will always show this "Unable to run job" message > > and return 0. > > I could only think of a wrapper, which will scan for a file which is only > written at the regular end of the job and set its exit value accordingly. > > -- Reuti > > > > 2012/9/21 Reuti <[email protected]> > > Am 21.09.2012 um 16:13 schrieb Julien Nicoulaud: > > > > > I tried to implement the -notify + trap USR2 solution, but could not get > > > it to work. I can trap the USR2 signal in the qmaster PE script, but as > > > soon as it is sent, the slave tasks get killed, leaving my application no > > > time to cleanly shut them down. The qmaster log displays: > > > > Is this a new question? Originally you wanted to get a proper exit code for > > -sync y, now to gracefully shut down. > > > > -- Reuti > > > > > > > tightly integrated parallel task 61969.1 task 1.computeXX failed - > > > killing job > > > > > > The queue is configured with "notify 00:00:60", so that should leave at > > > least one minute. I also tried to trap USR2 in the PE script and not > > > forward it all to child processes, but slave tasks still get killed. Is > > > there something else specific to do to avoid this? > > > > > > 2012/9/19 Julien Nicoulaud <[email protected]> > > > Yes, that's what I meant. For me, if control_slaves is FALSE, qsub > > > returns with a non-zero exit code after h_rt is elapsed. > > > > > > > > > 2012/9/19 Reuti <[email protected]> > > > Hi, > > > > > > Am 19.09.2012 um 14:36 schrieb Julien Nicoulaud: > > > > > > > On SGE 6.2u5, I submit jobs with -sync y and h_rt. When the jobs gets > > > > killed after the time is elapsed, qsub prints a "Unable to run job" > > > > message but exists with code 0. I tried to trap KILL signal > > > > inside the job script, but it does not seem to affect qsub return code. > > > > Is it possible to make it return 1 ? > > > > > > > > Note: it only behaves this way for jobs running in a tightly integrated > > > > parallel environment. In a loosely integrated PE, qsub returns 1 in > > > > this case... > > > > > > You mean the setting of "control_slaves"? For me it's always 0 if I > > > request a PE. > > > > > > -- Reuti > > > > > > > > > > > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
