ORTE propagates the signal to the application processes, but the ORTE daemons never actually look at the signal themselves (looks just like a message to them). So I'm a little puzzled by that error message about the "daemon received signal 12" - I suspect that's just a misleading message that was supposed to indicate that a daemon was given a signal to pass on.
Just to clarify: the daemons are moved out of your initial process group to avoid seeing any signals from your terminal. When you issue a signal, mpirun picks it up and forwards it to your application processes via the ORTE daemons - the ORTE daemons, however, do *not* look at that signal but just pass it along. As for timing, all we do is pass STOP to the OpenMPI application process - it's up to the local system as to what happens when a "kill -STOP" is issued. It was always my impression that the system stopped process execution immediately under that signal, but with some allowance for the old kernel vs user space issue. Once all the processes have terminated, mpirun tells the daemons to go ahead and exit. That's the only way the daemons get terminated in this procedure. Can you tell us something about your system? Is this running under Linux, what kind of OS, how was OpenMPI configured, etc? Thanks Ralph On 3/12/07 1:26 PM, "Reuti" <re...@staff.uni-marburg.de> wrote: > Am 12.03.2007 um 19:55 schrieb Ralph Castain: > >> I'll have to look into it - I suspect this is simply an erroneous >> message >> and that no daemon is actually being started. >> >> I'm not entirely sure I understand what's happening, though, in >> your code. >> Are you saying that mpirun starts some number of application >> processes which >> run merrily along, and then qsub sends out USR1/2 signals followed >> by STOP >> and then KILL in an effort to abort the job? So the application >> processes >> don't normally terminate, but instead are killed via these signals? > > If you specify -notify in SGE with the qsub, then jobs are warned by > the sge_shepered (parent if the job) during execution, so that they > could perfom some proper shutdown action, before they are really > stopped/killed: > > for suspend: USR1 -wait-defined-time- STOP > for kill: USR2 -wait-defined-time- KILL > > Worth to be noted: the signals are sent to the complete processgroup > of the job created by the jobscript and mpirun, but not to each > daemon which is created by the internal qrsh on any of the slave > nodes! This should be orte's duty. > > Question is also: are OpenMPI jobs surviving a STOP for some time at > all, or will there be timing issues due to communication timeouts? > > HTH - Reuti > > >> >> Just want to ensure I understand the scenario here as that is >> something >> obviously unique to GE. >> >> Thanks >> Ralph >> >> >> On 3/12/07 9:42 AM, "Olesen, Mark" <mark.ole...@arvinmeritor.com> >> wrote: >> >>> I'm testing openmpi 1.2rc1 with GridEngine 6.0u9 and ran into >>> interesting >>> behaviour when using the qsub -notify option. >>> With -notify, USR1 and USR2 are sent X seconds before sending STOP >>> and KILL >>> signals, respectively. >>> >>> When the USR2 signal is sent to the process group with the mpirun >>> process, I >>> receive an error message about not being able to start a daemon: >>> >>> mpirun: Forwarding signal 12 to job[dealc12:18212] ERROR: A daemon >>> on node >>> dealc12 failed to start as expected. >>> [dealc12:18212] ERROR: There may be more information available from >>> [dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine >>> tasks. >>> [dealc12:18212] ERROR: If the problem persists, please restart the >>> [dealc12:18212] ERROR: Grid Engine PE job >>> [dealc12:18212] The daemon received a signal 12. >>> [dealc12:18212] ERROR: A daemon on node dealc20 failed to start as >>> expected. >>> [dealc12:18212] ERROR: There may be more information available from >>> [dealc12:18212] ERROR: the 'qstat -t' command on the Grid Engine >>> tasks. >>> [dealc12:18212] ERROR: If the problem persists, please restart the >>> [dealc12:18212] ERROR: Grid Engine PE job >>> [dealc12:18212] The daemon received a signal 12. >>> >>> The job eventually stops, but the mpirun process itself continues >>> to live >>> (just the ppid changes). >>> >>> According to orte(1)/Signal Propagation, USR1 and USR2 should be >>> propagated >>> to all processes in the job (which seems to be happening), but why >>> is a >>> daemon start being attempted and the mpirun not being stopped? >>> >>> /mark >>> >>> This e-mail message and any attachments may contain legally >>> privileged, >>> confidential or proprietary Information, or information otherwise >>> protected by >>> law of ArvinMeritor, Inc., its affiliates, or third parties. This >>> notice >>> serves as marking of its „Confidential‰ status as defined in any >>> confidentiality agreements concerning the sender and recipient. If >>> you are not >>> the intended recipient(s), or the employee or agent responsible >>> for delivery >>> of this message to the intended recipient(s), you are hereby >>> notified that any >>> dissemination, distribution or copying of this e-mail message is >>> strictly >>> prohibited. If you have received this message in error, please >>> immediately >>> notify the sender and delete this e-mail message from your computer. >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users