Well, yes and no. When a process abnormally terminates, OMPI will kill the job 
- this is done by first hitting each process with a SIGTERM, followed shortly 
thereafter by a SIGKILL. So you do have a short time on each process to attempt 
to cleanup.

My guess is that your signal handler actually is getting called, but we then 
kill the process before you can detect that it was called.

You might try adjusting the time between sigterm and sigkill using the 
odls_base_sigkill_timeout MCA param:

mpirun -mca odls_base_sigkill_timeout N

should cause it to wait for N seconds before issuing the sigkill. Not sure if 
that will help or not - it used to work for me, but I haven't tried it for 
awhile. What versions of OMPI are you using?


On Mar 22, 2012, at 4:49 PM, Júlio Hoffimann wrote:

> Dear all,
> 
> I'm trying to handle signals inside a MPI task farming model. Following is a 
> pseudo-code of what i'm trying to achieve:
> 
> volatile sig_atomic_t unexpected_error_occurred = 0;
> 
> void my_handler( int sig )
> {
>     unexpected_error_occurred = 1;
> }
> 
> //
> // somewhere in the code...
> //
> 
> signal(SIGTERM, my_handler);
> 
> if (root process) {
> 
>     // do stuff
> 
>     if ( unexpected_error_occurred ) {
> 
>         // save something
> 
>         // reraise the SIGTERM again, but now with the default handler
>         signal(SIGTERM, SIG_DFL);
>         raise(SIGTERM);
>     }
> }
> else { // slave process
> 
>     // do different stuff
> 
>     if ( unexpected_error_occurred ) {
> 
>         // just propragate the signal to the root
>         signal(SIGTERM, SIG_DFL);
>         raise(SIGTERM);
>     }
> }
> 
> signal(SIGTERM, SIG_DFL);                       // reassign default handler
> 
> // continues the code...
> 
> As can be seen, the signal handling is required for implementing a restart 
> feature. All the problem resides in the assumption i made that all processes 
> in the communicator will receive a SIGTERM as a side effect. Is it a valid 
> assumption? How the actual MPI implementation deals with such scenarios?
> 
> I also tried to replace all the raise() calls by MPI_Abort(), which according 
> to the documentation (http://www.open-mpi.org/doc/v1.5/man3/MPI_Abort.3.php), 
> sends a SIGTERM to all associated processes. The undesired behaviour 
> persists: when killing a slave process, the save section in the root branch 
> is not executed.
> 
> Appreciate any help,
> Júlio.
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to