Dear Sven,

I thought about doing that and experimented a bit as well, but there are some problems then: I need to relink the users code, registering an atexit function is tricky from the fortran code, and I still need to know whether MPI_Finalize (and as it turns out MPI_Init as well, otherwise there's problems with things like call system) has been called before my atexit routine is called...

Best regards
Daniel

On Mon, 20 Aug 2007 14:37:44 +0200, Sven Stork <st...@hlrs.de> wrote:

instead of doing dirty with the library you could try to register a cleanup
function with atexit.

Thanks,
  Sven

On Friday 17 August 2007 19:59, Daniel Spångberg wrote:
Dear George,

I think that the best way is to call MPI_Abort. However, this forces the
user to modify the code, which I already have suggested. But their
application is not calling exit directly, I merely wrote the simplest code that demonstrates the problem. Their application is a Fortran program and
during file IO, when something bad happens, the fortran runtime (pgi)
calls exit (and sometimes _exit for some reason). The file IO is only done
in one process. I have told them to try to add ERR=linelo,END=lineno,
where the code at lineno calls MPI_Abort. This has not happened yet.
Nevertheless, openmpi does not terminate the application when one of
processes exits without MPI_Finalize, contrary to the content of mpirun
man-page. I have currently "solved" the problem by writing a .so that is
LD_PRELOAD:ed, checking whether MPI_Finalize is indeed called between
MPI_Init and exit/_exit. I'd rather not keep this "solution" for too long.
If it is indeed so that the mpirun man-page is wrong and the code right,
I'd rather push the proper error-handling solution.

Best regards
Daniel Spångberg


On Fri, 17 Aug 2007 18:25:17 +0200, George Bosilca <bosi...@eecs.utk.edu>
wrote:

> The MPI standard state that the correct way to abort/kill an MPI
> application is using the MPI_Abort function. Except, if you're doing
> some kind of fault tolerance stuff, there is no reason to end one of
> your MPI processes via exit.
>
>    Thanks,
>      george.
>
> On Aug 16, 2007, at 12:04 PM, Daniel Spångberg wrote:
>
>> Dear Open-MPI user list members,
>>
>> I am currently having a user with an application where one of the
>> MPI-processes die, but the openmpi-system does not kill the rest of
>> the
>> application.
>>
>> Since the mpirun man page states the following I would expect it to
>> take
>> care of killing the application if a process exits without calling
>> MPI_Finalize:
>>
>>     Process Termination / Signal Handling
>>         During  the run of an MPI application, if any rank dies
>> abnormally
>> (either exiting before invoking MPI_FINALIZE, or dying as the
>>         result of a signal), mpirun will print out an error message
>> and
>> kill the rest of the MPI application.
>>
>> The following test program demonstrates the behaviour (program
>> hangs until
>> it is killed by the user or batch system):
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>> #include <mpi.h>
>>
>> #define RANK_DEATH 1
>>
>> int main(int argc, char **argv)
>> {
>>    int rank;
>>    MPI_Init(&argc,&argv);
>>    MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>>
>>    sleep(10);
>>    if (rank==RANK_DEATH)
>>      exit(1);
>>    sleep(10);
>>    MPI_Finalize();
>>    return 0;
>> }
>>
>> I have tested this on openmpi 1.2.1 as well as the latest stable
>> 1.2.3. I
>> am on Linux x86_64.
>>
>> Is this a bug, or are there some flags I can use to force the
>> mpirun (or
>> orted, or...) to kill the whole MPI program when this happens?
>>
>> If one of the application processes die from a signal (I have
>> tested SEGV
>> and FPE) rather than just exiting the whole application is indeed
>> killed.
>>
>> Best regards
>> Daniel Spångberg
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





Reply via email to