I doubt anything will be done about those warnings, given that the MPI Forum 
has voted to remove the C++ bindings altogether.


On Mar 25, 2012, at 12:36 PM, Júlio Hoffimann wrote:

> I have no much time now for trying a more recent version, but i'll keep that 
> in mind. I also dislike the warnings my current version is giving me 
> (http://www.open-mpi.org/community/lists/devel/2011/08/9606.php). I'll see 
> how to contact Ubuntu maintainers to update OpenMPI and solve both problems 
> in one shot. ;-)
> 
> Regards,
> Júlio.
> 
> 2012/3/25 Ralph Castain <r...@open-mpi.org>
> 
> On Mar 25, 2012, at 11:28 AM, Júlio Hoffimann wrote:
> 
>> I wrote the version in a previous P.S. statement: MPI 1.4.3 from Ubuntu 
>> 11.10 repositories. :-)
> 
> Sorry - I see a lot of emails over the day, and forgot. :-/
> 
> Have you tried this on something more recent, like 1.5.4 or even the 
> developer's trunk? IIRC, there were some issues in the older 1.4 releases, 
> but they have since been fixed.
> 
>> 
>> Thanks for the clarifications!
>> 
>> 2012/3/25 Ralph Castain <r...@open-mpi.org>
>> 
>> On Mar 25, 2012, at 10:57 AM, Júlio Hoffimann wrote:
>> 
>>> I forgot to mention, i tried to set the odls_base_sigkill_timeout as you 
>>> told, even 5s was not sufficient for the root execute it's task, and most 
>>> important, the kill was instantaneous, there is no 5s hang. My erroneous 
>>> conclusion: SIGKILL was being sent instead of SIGTERM.
>> 
>> Which version are you using? Could be a bug in there - I can take a look.
>> 
>>> 
>>> About the man page, at least for me, the word "kill" is not clear. The 
>>> SIGTERM+SIGKILL keywords would be unambiguous.
>> 
>> I'll clarify it - thanks!
>> 
>>> 
>>> Regards,
>>> Júlio.
>>> 
>>> 2012/3/25 Ralph Castain <r...@open-mpi.org>
>>> 
>>> On Mar 25, 2012, at 7:19 AM, Júlio Hoffimann wrote:
>>> 
>>>> Dear Ralph,
>>>> 
>>>> Thank you for your prompt reply. I confirmed what you just said by reading 
>>>> the mpirun man page at the sections Signal Propagation and Process 
>>>> Termination / Signal Handling.
>>>> 
>>>> "During the run of an MPI  application,  if  any  rank  dies  abnormally 
>>>> (either exiting before invoking MPI_FINALIZE, or dying as the result of a 
>>>> signal), mpirun will print out an error message and kill the rest  of the 
>>>> MPI application."
>>>> 
>>>> If i understood correctly, the SIGKILL signal is sent to every process on 
>>>> a premature death.
>>> 
>>> Each process receives a SIGTERM, and then a SIGKILL if it doesn't exit 
>>> within a specified time frame. I told you how to adjust that time period in 
>>> the prior message.
>>> 
>>>> In my point of view, i consider this a bug. If OpenMPI allows handling 
>>>> signals such as SIGTERM, the other processes in the communicator should 
>>>> also have the opportunity to die prettily. Perhaps i'm missing something?
>>> 
>>> Yes, you are - you do get a SIGTERM first, but you are required to exit in 
>>> a timely fashion. You are not allowed to continue running. This is required 
>>> in order to ensure proper cleanup of the job, per the MPI standard.
>>> 
>>>> 
>>>> Supposing the described behaviour in the last paragraph, i think would be 
>>>> great to explicitly mention the SIGKILL in the man page, or even better, 
>>>> fix the implementation to send SIGTERM instead, making possible for the 
>>>> user cleanup all processes before exit.
>>> 
>>> We already do, as described above.
>>> 
>>>> 
>>>> I solved my particular problem by adding another flag 
>>>> unexpected_error_on_slave:
>>>> 
>>>> volatile sig_atomic_t unexpected_error_occurred = 0;
>>>> int unexpected_error_on_slave = 0;
>>>> enum tag { work_tag, die_tag }
>>>> 
>>>> void my_handler( int sig )
>>>> {
>>>>     unexpected_error_occurred = 1;
>>>> }
>>>> 
>>>> //
>>>> // somewhere in the code...
>>>> //
>>>> 
>>>> signal(SIGTERM, my_handler);
>>>> 
>>>> if (root process) {
>>>> 
>>>>     // do stuff
>>>> 
>>>>     world.recv(mpi::any_source, die_tag, unexpected_error_on_slave);
>>>>     if ( unexpected_error_occurred || unexpected_error_on_slave ) {
>>>> 
>>>>         // save something
>>>> 
>>>>         world.abort(SIGABRT);
>>>>     }
>>>> }
>>>> else { // slave process
>>>> 
>>>>     // do different stuff
>>>> 
>>>>     if ( unexpected_error_occurred ) {
>>>> 
>>>>         // just communicate the problem to the root
>>>>         world.send(root,die_tag,1);
>>>>         signal(SIGTERM,SIG_DFL);
>>>>         while(true)
>>>>             ; // wait, master will take care of this
>>>>     }
>>>>     world.send(root,die_tag,0); // everything is fine
>>>> }
>>>> 
>>>> signal(SIGTERM, SIG_DFL);                       // reassign default handler
>>>> 
>>>> // continues the code...
>>>> 
>>>> Note the slave must hang for the store operation get executed at the root, 
>>>> otherwise we back for the previous scenario. It's theoretically 
>>>> unnecessary send MPI messages to accomplish the desired cleanup, and in 
>>>> more complex applications this can turn into a nightmare. As we know, 
>>>> asynchronous events are insane to debug.
>>>> 
>>>> Best regards,
>>>> Júlio.
>>>> 
>>>> P.S.: MPI 1.4.3 from Ubuntu 11.10 repositories.
>>>> 
>>>> 2012/3/23 Ralph Castain <r...@open-mpi.org>
>>>> Well, yes and no. When a process abnormally terminates, OMPI will kill the 
>>>> job - this is done by first hitting each process with a SIGTERM, followed 
>>>> shortly thereafter by a SIGKILL. So you do have a short time on each 
>>>> process to attempt to cleanup.
>>>> 
>>>> My guess is that your signal handler actually is getting called, but we 
>>>> then kill the process before you can detect that it was called.
>>>> 
>>>> You might try adjusting the time between sigterm and sigkill using the 
>>>> odls_base_sigkill_timeout MCA param:
>>>> 
>>>> mpirun -mca odls_base_sigkill_timeout N
>>>> 
>>>> should cause it to wait for N seconds before issuing the sigkill. Not sure 
>>>> if that will help or not - it used to work for me, but I haven't tried it 
>>>> for awhile. What versions of OMPI are you using?
>>>> 
>>>> 
>>>> On Mar 22, 2012, at 4:49 PM, Júlio Hoffimann wrote:
>>>> 
>>>>> Dear all,
>>>>> 
>>>>> I'm trying to handle signals inside a MPI task farming model. Following 
>>>>> is a pseudo-code of what i'm trying to achieve:
>>>>> 
>>>>> volatile sig_atomic_t unexpected_error_occurred = 0;
>>>>> 
>>>>> void my_handler( int sig )
>>>>> {
>>>>>     unexpected_error_occurred = 1;
>>>>> }
>>>>> 
>>>>> //
>>>>> // somewhere in the code...
>>>>> //
>>>>> 
>>>>> signal(SIGTERM, my_handler);
>>>>> 
>>>>> if (root process) {
>>>>> 
>>>>>     // do stuff
>>>>> 
>>>>>     if ( unexpected_error_occurred ) {
>>>>> 
>>>>>         // save something
>>>>> 
>>>>>         // reraise the SIGTERM again, but now with the default handler
>>>>>         signal(SIGTERM, SIG_DFL);
>>>>>         raise(SIGTERM);
>>>>>     }
>>>>> }
>>>>> else { // slave process
>>>>> 
>>>>>     // do different stuff
>>>>> 
>>>>>     if ( unexpected_error_occurred ) {
>>>>> 
>>>>>         // just propragate the signal to the root
>>>>>         signal(SIGTERM, SIG_DFL);
>>>>>         raise(SIGTERM);
>>>>>     }
>>>>> }
>>>>> 
>>>>> signal(SIGTERM, SIG_DFL);                       // reassign default 
>>>>> handler
>>>>> 
>>>>> // continues the code...
>>>>> 
>>>>> As can be seen, the signal handling is required for implementing a 
>>>>> restart feature. All the problem resides in the assumption i made that 
>>>>> all processes in the communicator will receive a SIGTERM as a side 
>>>>> effect. Is it a valid assumption? How the actual MPI implementation deals 
>>>>> with such scenarios?
>>>>> 
>>>>> I also tried to replace all the raise() calls by MPI_Abort(), which 
>>>>> according to the documentation 
>>>>> (http://www.open-mpi.org/doc/v1.5/man3/MPI_Abort.3.php), sends a SIGTERM 
>>>>> to all associated processes. The undesired behaviour persists: when 
>>>>> killing a slave process, the save section in the root branch is not 
>>>>> executed.
>>>>> 
>>>>> Appreciate any help,
>>>>> Júlio.
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to