Dear George,
I think that the best way is to call MPI_Abort. However, this forces the
user to modify the code, which I already have suggested. But their
application is not calling exit directly, I merely wrote the simplest code
that demonstrates the problem. Their application is a Fortran program and
during file IO, when something bad happens, the fortran runtime (pgi)
calls exit (and sometimes _exit for some reason). The file IO is only done
in one process. I have told them to try to add ERR=linelo,END=lineno,
where the code at lineno calls MPI_Abort. This has not happened yet.
Nevertheless, openmpi does not terminate the application when one of
processes exits without MPI_Finalize, contrary to the content of mpirun
man-page. I have currently "solved" the problem by writing a .so that is
LD_PRELOAD:ed, checking whether MPI_Finalize is indeed called between
MPI_Init and exit/_exit. I'd rather not keep this "solution" for too long.
If it is indeed so that the mpirun man-page is wrong and the code right,
I'd rather push the proper error-handling solution.
Best regards
Daniel Spångberg
On Fri, 17 Aug 2007 18:25:17 +0200, George Bosilca <bosi...@eecs.utk.edu>
wrote:
The MPI standard state that the correct way to abort/kill an MPI
application is using the MPI_Abort function. Except, if you're doing
some kind of fault tolerance stuff, there is no reason to end one of
your MPI processes via exit.
Thanks,
george.
On Aug 16, 2007, at 12:04 PM, Daniel Spångberg wrote:
Dear Open-MPI user list members,
I am currently having a user with an application where one of the
MPI-processes die, but the openmpi-system does not kill the rest of
the
application.
Since the mpirun man page states the following I would expect it to
take
care of killing the application if a process exits without calling
MPI_Finalize:
Process Termination / Signal Handling
During the run of an MPI application, if any rank dies
abnormally
(either exiting before invoking MPI_FINALIZE, or dying as the
result of a signal), mpirun will print out an error message
and
kill the rest of the MPI application.
The following test program demonstrates the behaviour (program
hangs until
it is killed by the user or batch system):
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>
#define RANK_DEATH 1
int main(int argc, char **argv)
{
int rank;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
sleep(10);
if (rank==RANK_DEATH)
exit(1);
sleep(10);
MPI_Finalize();
return 0;
}
I have tested this on openmpi 1.2.1 as well as the latest stable
1.2.3. I
am on Linux x86_64.
Is this a bug, or are there some flags I can use to force the
mpirun (or
orted, or...) to kill the whole MPI program when this happens?
If one of the application processes die from a signal (I have
tested SEGV
and FPE) rather than just exiting the whole application is indeed
killed.
Best regards
Daniel Spångberg
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users