Re: [OMPI users] signal handling

Reuti Mon, 12 Mar 2007 15:38:38 -0400

Am 12.03.2007 um 20:22 schrieb Pak Lui:

Hi Mark,
Olesen, Mark wrote:
I'm testing openmpi 1.2rc1 with GridEngine 6.0u9 and ran intointeresting
behaviour when using the qsub -notify option.
With -notify, USR1 and USR2 are sent X seconds before sending STOPand KILL
signals, respectively.
When the USR2 signal is sent to the process group with the mpirunprocess, I
receive an error message about not being able to start a daemon:
mpirun: Forwarding signal 12 to job[dealc12:18212] ERROR: A daemonon node
dealc12 failed to start as expected.
[dealc12:18212] ERROR: There may be more information available from
[dealc12:18212] ERROR: the 'qstat -t' command on the Grid Enginetasks.
[dealc12:18212] ERROR: If the problem persists, please restart the
[dealc12:18212] ERROR: Grid Engine PE job
[dealc12:18212] The daemon received a signal 12.
[dealc12:18212] ERROR: A daemon on node dealc20 failed to start asexpected.
[dealc12:18212] ERROR: There may be more information available from
[dealc12:18212] ERROR: the 'qstat -t' command on the Grid Enginetasks.
[dealc12:18212] ERROR: If the problem persists, please restart the
[dealc12:18212] ERROR: Grid Engine PE job
[dealc12:18212] The daemon received a signal 12.
Hmm, this is interesting. I just tried using the -notify on my sample
batch job but I wasn't able to reproduce on Solaris. I tried sending a
USR2 signal and that kills all of the child and mpirun exits.

Isn't on Solaris the additonal group ID used to determine theprocesses to signal, but default on Linux is the processgroup (unlessotherwise configured in the SGE config)?


-- Reuti

Usually when the child processes aren't started up properly, thereis a
high chance that the qrsh or orte daemon is not started.

% qsub -notify start-qsub.tcsh
Your job 1277 ("job") has been submitted
% kill -USR2 `pgrep orterun`
% more job.*1278
::::::::::::::
job.e1278
::::::::::::::
mpirun: Forwarding signal 17 to jobmpirun noticed that job rank 0 with
PID 12562 on node burl-ct-v440-5 exited on signal 17 (User Signal 2).
::::::::::::::
job.o1278
::::::::::::::
Warning: no access to tty (Bad file number).
Thus no job control in this shell.
Sun Microsystems Inc.   SunOS 5.10      Generic January 2005
2 additional processes aborted (not shown)
The job eventually stops, but the mpirun process itself continuesto live
(just the ppid changes).
According to orte(1)/Signal Propagation, USR1 and USR2 should bepropagatedto all processes in the job (which seems to be happening), but whyis a
daemon start being attempted and the mpirun not being stopped?
Is there a way you can print the stack on the mpirun, to see if it's
waiting for something?
/mark
This e-mail message and any attachments may contain legallyprivileged, confidential or proprietary Information, orinformation otherwise protected by law of ArvinMeritor, Inc., itsaffiliates, or third parties. This notice serves as marking ofits 鼎onfidential・status as defined in any confidentialityagreements concerning the sender and recipient. If you are not theintended recipient(s), or the employee or agent responsible fordelivery of this message to the intended recipient(s), you arehereby notified that any dissemination, distribution or copying ofthis e-mail message is strictly prohibited. If you have receivedthis message in error, please immediately notify the sender anddelete this e-mail message from your computer.
------------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--

- Pak Lui
pak....@sun.com
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] signal handling

Reply via email to