Am 13.03.2007 um 06:01 schrieb Ralph Castain:
I've been letting this rattle around in my head some more, and
*may* have
come up with an idea of what *might* be going on.
In the GE environment, qsub only launches the daemons - the daemons
are the
ones that actually "launch" your local application processes. If qsub
-notify uses qsub's knowledge of the processes being executed, then it
*might* be tempted to send the USR1/2 signals directly to the
daemons as
well as mpirun.
Only the processgroup with (jobscript + mpirun + kids) should get it
on the headnode of the parallel job. Like with the sigstop. Otherwise
a suspend of parallel jobs would already be built into SGE.
In that case, it might be that our daemon's call to separate
from the process group isn't adequate to break that qsub connection
- we may
be separating from the Linux/Solaris process group, but not from
qsub's list
of executing processes.
IF that is true, then this could cause some strange behavior. I
honestly
have no idea what a USR1/2 signal hitting the daemon would do - we
don't try
to trap that signal in the daemon, so it likely would be ignored.
The default is to terminate for usr1/2 AFAIK.
However,
it is possible that something unusual could occur (though why it
would try
to spawn another daemon is beyond me).
I can assure you, though, that the daemon really won't like getting
a STOP
or KILL sent directly to it - this definitely would cause shutdown
issues
They get a kill for sure, but no stop.
Do you have access to a SGE cluster?
-- Reuti
with respect to cleanup and possibly cause mpirun and/or your
application to
"hang". Again, we don't trap those signals in the daemon (only in
mpirun
itself). When mpirun traps them, it sends an "abort" message to the
daemons
so they can cleanly exit (terminating their local procs along the
way), thus
bringing the system down in an orderly fashion.
Again, IF this is happening, then it could be that the application
processes
are getting signals from two sources: (a) as part of the daemon's
local
process group on the node (since the daemon fork/exec's the local
procs),
and (b) propagated via the daemons by comm from mpirun. This could
cause
some interesting race conditions.
Anyway, I think someone more familiar with the peculiarities of
qsub -notify
will have to step in here. If my explanation is correct, then we
likely have
a problem that needs to be addressed for the GE environment.
Otherwise,
there may be something else at work here.
Ralph
On 3/12/07 9:42 AM, "Olesen, Mark" <mark.ole...@arvinmeritor.com>
wrote:
I'm testing openmpi 1.2rc1 with GridEngine 6.0u9 and ran into
interesting
behaviour when using the qsub -notify option.
With -notify, USR1 and USR2 are sent X seconds before sending STOP
and KILL
signals, respectively.
When the USR2 signal is sent to the process group with the mpirun
process, I
receive an error message about not being able to start a daemon:
mpirun: Forwarding signal 12 to job[dealc12:18212] ERROR: A daemon
on node
dealc12 failed to start as expected.
[dealc12:18212] ERROR: There may be more information available from
[dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine
tasks.
[dealc12:18212] ERROR: If the problem persists, please restart the
[dealc12:18212] ERROR: Grid Engine PE job
[dealc12:18212] The daemon received a signal 12.
[dealc12:18212] ERROR: A daemon on node dealc20 failed to start as
expected.
[dealc12:18212] ERROR: There may be more information available from
[dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine
tasks.
[dealc12:18212] ERROR: If the problem persists, please restart the
[dealc12:18212] ERROR: Grid Engine PE job
[dealc12:18212] The daemon received a signal 12.
The job eventually stops, but the mpirun process itself continues
to live
(just the ppid changes).
According to orte(1)/Signal Propagation, USR1 and USR2 should be
propagated
to all processes in the job (which seems to be happening), but why
is a
daemon start being attempted and the mpirun not being stopped?
/mark
This e-mail message and any attachments may contain legally
privileged,
confidential or proprietary Information, or information otherwise
protected by
law of ArvinMeritor, Inc., its affiliates, or third parties. This
notice
serves as marking of its „Confidential‰ status as defined in any
confidentiality agreements concerning the sender and recipient. If
you are not
the intended recipient(s), or the employee or agent responsible
for delivery
of this message to the intended recipient(s), you are hereby
notified that any
dissemination, distribution or copying of this e-mail message is
strictly
prohibited. If you have received this message in error, please
immediately
notify the sender and delete this e-mail message from your computer.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users