Dear George,
I think that the best way is to call MPI_Abort. However, this forces the
user to modify the code, which I already have suggested. But their
application is not calling exit directly, I merely wrote the simplest
code
that demonstrates the problem. Their application is a Fortran program
and
during file IO, when something bad happens, the fortran runtime (pgi)
calls exit (and sometimes _exit for some reason). The file IO is only
done
in one process. I have told them to try to add ERR=linelo,END=lineno,
where the code at lineno calls MPI_Abort. This has not happened yet.
Nevertheless, openmpi does not terminate the application when one of
processes exits without MPI_Finalize, contrary to the content of mpirun
man-page. I have currently "solved" the problem by writing a .so that is
LD_PRELOAD:ed, checking whether MPI_Finalize is indeed called between
MPI_Init and exit/_exit. I'd rather not keep this "solution" for too
long.
If it is indeed so that the mpirun man-page is wrong and the code right,
I'd rather push the proper error-handling solution.
Best regards
Daniel Spångberg
On Fri, 17 Aug 2007 18:25:17 +0200, George Bosilca
<bosi...@eecs.utk.edu>
wrote:
> The MPI standard state that the correct way to abort/kill an MPI
> application is using the MPI_Abort function. Except, if you're doing
> some kind of fault tolerance stuff, there is no reason to end one of
> your MPI processes via exit.
>
> Thanks,
> george.
>
> On Aug 16, 2007, at 12:04 PM, Daniel Spångberg wrote:
>
>> Dear Open-MPI user list members,
>>
>> I am currently having a user with an application where one of the
>> MPI-processes die, but the openmpi-system does not kill the rest of
>> the
>> application.
>>
>> Since the mpirun man page states the following I would expect it to
>> take
>> care of killing the application if a process exits without calling
>> MPI_Finalize:
>>
>> Process Termination / Signal Handling
>> During the run of an MPI application, if any rank dies
>> abnormally
>> (either exiting before invoking MPI_FINALIZE, or dying as the
>> result of a signal), mpirun will print out an error message
>> and
>> kill the rest of the MPI application.
>>
>> The following test program demonstrates the behaviour (program
>> hangs until
>> it is killed by the user or batch system):
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>> #include <mpi.h>
>>
>> #define RANK_DEATH 1
>>
>> int main(int argc, char **argv)
>> {
>> int rank;
>> MPI_Init(&argc,&argv);
>> MPI_Comm_rank(MPI_COMM_WORLD,&rank);
>>
>> sleep(10);
>> if (rank==RANK_DEATH)
>> exit(1);
>> sleep(10);
>> MPI_Finalize();
>> return 0;
>> }
>>
>> I have tested this on openmpi 1.2.1 as well as the latest stable
>> 1.2.3. I
>> am on Linux x86_64.
>>
>> Is this a bug, or are there some flags I can use to force the
>> mpirun (or
>> orted, or...) to kill the whole MPI program when this happens?
>>
>> If one of the application processes die from a signal (I have
>> tested SEGV
>> and FPE) rather than just exiting the whole application is indeed
>> killed.
>>
>> Best regards
>> Daniel Spångberg
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users