Am 12.03.2007 um 20:36 schrieb Ralph Castain:
ORTE propagates the signal to the application processes, but the ORTE
daemons never actually look at the signal themselves (looks just
like a
message to them). So I'm a little puzzled by that error message
about the
"daemon received signal 12" - I suspect that's just a misleading
message
that was supposed to indicate that a daemon was given a signal to
pass on.
Just to clarify: the daemons are moved out of your initial process
group to
Is this still the case also in SGE mode? It was the reason why I
never wrote a Howto for a Tight Integration under SGE. Instead I
looked forward for the final 1.2 with full SGE support.
And: this might be odd under SGE. I must admit, that I didn't have
had the time up to play with OpenMPI 1.2-beta for the Tight
Integration, but it sounds to me like (under Linux) the orte-daemons
could survive although the job was already killed (by processgroup),
as the final stop/kill can't be caught and forwarded.
I'll check this ASAP with 1.2-beta. I have only access to Linux
clusters.
But now we are going beyond Mark's initial problem.
-- Reuti
avoid seeing any signals from your terminal. When you issue a
signal, mpirun
picks it up and forwards it to your application processes via the ORTE
daemons - the ORTE daemons, however, do *not* look at that signal
but just
pass it along.
As for timing, all we do is pass STOP to the OpenMPI application
process -
it's up to the local system as to what happens when a "kill -STOP" is
issued. It was always my impression that the system stopped process
execution immediately under that signal, but with some allowance
for the old
kernel vs user space issue.
Once all the processes have terminated, mpirun tells the daemons to
go ahead
and exit. That's the only way the daemons get terminated in this
procedure.
Can you tell us something about your system? Is this running under
Linux,
what kind of OS, how was OpenMPI configured, etc?
Thanks
Ralph
On 3/12/07 1:26 PM, "Reuti" <re...@staff.uni-marburg.de> wrote:
Am 12.03.2007 um 19:55 schrieb Ralph Castain:
I'll have to look into it - I suspect this is simply an erroneous
message
and that no daemon is actually being started.
I'm not entirely sure I understand what's happening, though, in
your code.
Are you saying that mpirun starts some number of application
processes which
run merrily along, and then qsub sends out USR1/2 signals followed
by STOP
and then KILL in an effort to abort the job? So the application
processes
don't normally terminate, but instead are killed via these signals?
If you specify -notify in SGE with the qsub, then jobs are warned by
the sge_shepered (parent if the job) during execution, so that they
could perfom some proper shutdown action, before they are really
stopped/killed:
for suspend: USR1 -wait-defined-time- STOP
for kill: USR2 -wait-defined-time- KILL
Worth to be noted: the signals are sent to the complete processgroup
of the job created by the jobscript and mpirun, but not to each
daemon which is created by the internal qrsh on any of the slave
nodes! This should be orte's duty.
Question is also: are OpenMPI jobs surviving a STOP for some time at
all, or will there be timing issues due to communication timeouts?
HTH - Reuti
Just want to ensure I understand the scenario here as that is
something
obviously unique to GE.
Thanks
Ralph
On 3/12/07 9:42 AM, "Olesen, Mark" <mark.ole...@arvinmeritor.com>
wrote:
I'm testing openmpi 1.2rc1 with GridEngine 6.0u9 and ran into
interesting
behaviour when using the qsub -notify option.
With -notify, USR1 and USR2 are sent X seconds before sending STOP
and KILL
signals, respectively.
When the USR2 signal is sent to the process group with the mpirun
process, I
receive an error message about not being able to start a daemon:
mpirun: Forwarding signal 12 to job[dealc12:18212] ERROR: A daemon
on node
dealc12 failed to start as expected.
[dealc12:18212] ERROR: There may be more information available from
[dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine
tasks.
[dealc12:18212] ERROR: If the problem persists, please restart the
[dealc12:18212] ERROR: Grid Engine PE job
[dealc12:18212] The daemon received a signal 12.
[dealc12:18212] ERROR: A daemon on node dealc20 failed to start as
expected.
[dealc12:18212] ERROR: There may be more information available from
[dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine
tasks.
[dealc12:18212] ERROR: If the problem persists, please restart the
[dealc12:18212] ERROR: Grid Engine PE job
[dealc12:18212] The daemon received a signal 12.
The job eventually stops, but the mpirun process itself continues
to live
(just the ppid changes).
According to orte(1)/Signal Propagation, USR1 and USR2 should be
propagated
to all processes in the job (which seems to be happening), but why
is a
daemon start being attempted and the mpirun not being stopped?
/mark
This e-mail message and any attachments may contain legally
privileged,
confidential or proprietary Information, or information otherwise
protected by
law of ArvinMeritor, Inc., its affiliates, or third parties. This
notice
serves as marking of its „Confidential‰ status as defined in any
confidentiality agreements concerning the sender and recipient. If
you are not
the intended recipient(s), or the employee or agent responsible
for delivery
of this message to the intended recipient(s), you are hereby
notified that any
dissemination, distribution or copying of this e-mail message is
strictly
prohibited. If you have received this message in error, please
immediately
notify the sender and delete this e-mail message from your
computer.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users