Re: [OMPI users] Process termination problem

Daniel Spångberg Fri, 17 Aug 2007 13:59:21 -0400

Dear George,

I think that the best way is to call MPI_Abort. However, this forces theuser to modify the code, which I already have suggested. But theirapplication is not calling exit directly, I merely wrote the simplest codethat demonstrates the problem. Their application is a Fortran program andduring file IO, when something bad happens, the fortran runtime (pgi)calls exit (and sometimes _exit for some reason). The file IO is only donein one process. I have told them to try to add ERR=linelo,END=lineno,where the code at lineno calls MPI_Abort. This has not happened yet.Nevertheless, openmpi does not terminate the application when one ofprocesses exits without MPI_Finalize, contrary to the content of mpirunman-page. I have currently "solved" the problem by writing a .so that isLD_PRELOAD:ed, checking whether MPI_Finalize is indeed called betweenMPI_Init and exit/_exit. I'd rather not keep this "solution" for too long.If it is indeed so that the mpirun man-page is wrong and the code right,I'd rather push the proper error-handling solution.


Best regards
Daniel Spångberg

On Fri, 17 Aug 2007 18:25:17 +0200, George Bosilca <bosi...@eecs.utk.edu>wrote:

The MPI standard state that the correct way to abort/kill an MPI
application is using the MPI_Abort function. Except, if you're doing
some kind of fault tolerance stuff, there is no reason to end one of
your MPI processes via exit.

   Thanks,
     george.

On Aug 16, 2007, at 12:04 PM, Daniel Spångberg wrote:

Dear Open-MPI user list members,

I am currently having a user with an application where one of the
MPI-processes die, but the openmpi-system does not kill the rest of
the
application.

Since the mpirun man page states the following I would expect it to
take
care of killing the application if a process exits without calling
MPI_Finalize:

    Process Termination / Signal Handling
        During  the run of an MPI application, if any rank dies
abnormally
(either exiting before invoking MPI_FINALIZE, or dying as the
        result of a signal), mpirun will print out an error message
and
kill the rest of the MPI application.

The following test program demonstrates the behaviour (program
hangs until
it is killed by the user or batch system):

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>

#define RANK_DEATH 1

int main(int argc, char **argv)
{
   int rank;
   MPI_Init(&argc,&argv);
   MPI_Comm_rank(MPI_COMM_WORLD,&rank);

   sleep(10);
   if (rank==RANK_DEATH)
     exit(1);
   sleep(10);
   MPI_Finalize();
   return 0;
}

I have tested this on openmpi 1.2.1 as well as the latest stable
1.2.3. I
am on Linux x86_64.

Is this a bug, or are there some flags I can use to force the
mpirun (or
orted, or...) to kill the whole MPI program when this happens?

If one of the application processes die from a signal (I have
tested SEGV
and FPE) rather than just exiting the whole application is indeed
killed.

Best regards
Daniel Spångberg
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Process termination problem

Reply via email to