I wrote the version in a previous P.S. statement: MPI 1.4.3 from Ubuntu 11.10 repositories. :-)
Thanks for the clarifications! 2012/3/25 Ralph Castain <r...@open-mpi.org> > > On Mar 25, 2012, at 10:57 AM, Júlio Hoffimann wrote: > > I forgot to mention, i tried to set the odls_base_sigkill_timeout as you > told, even 5s was not sufficient for the root execute it's task, and most > important, the kill was instantaneous, there is no 5s hang. My erroneous > conclusion: SIGKILL was being sent instead of SIGTERM. > > > Which version are you using? Could be a bug in there - I can take a look. > > > About the man page, at least for me, the word "kill" is not clear. The > SIGTERM+SIGKILL keywords would be unambiguous. > > > I'll clarify it - thanks! > > > Regards, > Júlio. > > 2012/3/25 Ralph Castain <r...@open-mpi.org> > >> >> On Mar 25, 2012, at 7:19 AM, Júlio Hoffimann wrote: >> >> Dear Ralph, >> >> Thank you for your prompt reply. I confirmed what you just said by >> reading the mpirun man page at the sections *Signal Propagation* and *Process >> Termination / Signal Handling*. >> >> "During the run of an MPI application, if any rank dies >> abnormally (either exiting before invoking MPI_FINALIZE, or dying as the >> result of a signal), mpirun will print out an error message and kill the >> rest of the MPI application." >> >> If i understood correctly, the SIGKILL signal is sent to every process on >> a premature death. >> >> >> Each process receives a SIGTERM, and then a SIGKILL if it doesn't exit >> within a specified time frame. I told you how to adjust that time period in >> the prior message. >> >> In my point of view, i consider this a bug. If OpenMPI allows handling >> signals such as SIGTERM, the other processes in the communicator should >> also have the opportunity to die prettily. Perhaps i'm missing something? >> >> >> Yes, you are - you do get a SIGTERM first, but you are required to exit >> in a timely fashion. You are not allowed to continue running. This is >> required in order to ensure proper cleanup of the job, per the MPI standard. >> >> >> Supposing the described behaviour in the last paragraph, i think would be >> great to explicitly mention the SIGKILL in the man page, or even better, >> fix the implementation to send SIGTERM instead, making possible for the >> user cleanup all processes before exit. >> >> >> We already do, as described above. >> >> >> I solved my particular problem by adding another flag * >> unexpected_error_on_slave*: >> >> volatile sig_atomic_t unexpected_error_occurred = 0;int >> unexpected_error_on_slave = 0;enum tag { work_tag, die_tag } >> void my_handler( int sig ){ >> unexpected_error_occurred = 1;} >> //// somewhere in the code...// >> signal(SIGTERM, my_handler); >> if (root process) { >> >> // do stuff >> >> world.recv(mpi::any_source, die_tag, unexpected_error_on_slave); >> if ( unexpected_error_occurred || unexpected_error_on_slave ) { >> >> // save something >> >> world.abort(SIGABRT); >> }}else { // slave process >> >> // do different stuff >> >> if ( unexpected_error_occurred ) { >> >> // just communicate the problem to the root >> world.send(root,die_tag,1); >> signal(SIGTERM,SIG_DFL); >> while(true) >> ; // wait, master will take care of this >> } >> world.send(root,die_tag,0); // everything is fine} >> signal(SIGTERM, SIG_DFL); // reassign default handler >> // continues the code... >> >> >> Note the slave must hang for the store operation get executed at the >> root, otherwise we back for the previous scenario. It's theoretically >> unnecessary send MPI messages to accomplish the desired cleanup, and in >> more complex applications this can turn into a nightmare. As we know, >> asynchronous events are insane to debug. >> >> Best regards, >> Júlio. >> >> P.S.: MPI 1.4.3 from Ubuntu 11.10 repositories. >> >> 2012/3/23 Ralph Castain <r...@open-mpi.org> >> >>> Well, yes and no. When a process abnormally terminates, OMPI will kill >>> the job - this is done by first hitting each process with a SIGTERM, >>> followed shortly thereafter by a SIGKILL. So you do have a short time on >>> each process to attempt to cleanup. >>> >>> My guess is that your signal handler actually is getting called, but we >>> then kill the process before you can detect that it was called. >>> >>> You might try adjusting the time between sigterm and sigkill using >>> the odls_base_sigkill_timeout MCA param: >>> >>> mpirun -mca odls_base_sigkill_timeout N >>> >>> should cause it to wait for N seconds before issuing the sigkill. Not >>> sure if that will help or not - it used to work for me, but I haven't tried >>> it for awhile. What versions of OMPI are you using? >>> >>> >>> On Mar 22, 2012, at 4:49 PM, Júlio Hoffimann wrote: >>> >>> Dear all, >>> >>> I'm trying to handle signals inside a MPI task farming model. Following >>> is a pseudo-code of what i'm trying to achieve: >>> >>> volatile sig_atomic_t unexpected_error_occurred = 0; >>> void my_handler( int sig ){ >>> unexpected_error_occurred = 1;} >>> //// somewhere in the code...// >>> signal(SIGTERM, my_handler); >>> if (root process) { >>> >>> // do stuff >>> >>> if ( unexpected_error_occurred ) { >>> >>> // save something >>> >>> // reraise the SIGTERM again, but now with the default handler >>> signal(SIGTERM, SIG_DFL); >>> raise(SIGTERM); >>> }}else { // slave process >>> >>> // do different stuff >>> >>> if ( unexpected_error_occurred ) { >>> >>> // just propragate the signal to the root >>> signal(SIGTERM, SIG_DFL); >>> raise(SIGTERM); >>> }} >>> signal(SIGTERM, SIG_DFL); // reassign default handler >>> // continues the code... >>> >>> >>> As can be seen, the signal handling is required for implementing a >>> restart feature. All the problem resides in the assumption i made that all >>> processes in the communicator will receive a SIGTERM as a side effect. Is >>> it a valid assumption? How the actual MPI implementation deals with such >>> scenarios? >>> >>> I also tried to replace all the raise() calls by MPI_Abort(), which >>> according to the documentation ( >>> http://www.open-mpi.org/doc/v1.5/man3/MPI_Abort.3.php), sends a SIGTERM >>> to all associated processes. The undesired behaviour persists: when killing >>> a slave process, the save section in the root branch is not executed. >>> >>> Appreciate any help, >>> Júlio. >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >