Re: [OMPI users] Process termination problem

Sven Stork Mon, 20 Aug 2007 08:37:38 -0400

instead of doing dirty with the library you could try to register a cleanup 
function with atexit.


Thanks,
  Sven 

On Friday 17 August 2007 19:59, Daniel Spångberg wrote:
> Dear George,
> 
> I think that the best way is to call MPI_Abort. However, this forces the  
> user to modify the code, which I already have suggested. But their  
> application is not calling exit directly, I merely wrote the simplest code  
> that demonstrates the problem. Their application is a Fortran program and  
> during file IO, when something bad happens, the fortran runtime (pgi)  
> calls exit (and sometimes _exit for some reason). The file IO is only done  
> in one process. I have told them to try to add ERR=linelo,END=lineno,  
> where the code at lineno calls MPI_Abort. This has not happened yet.  
> Nevertheless, openmpi does not terminate the application when one of  
> processes exits without MPI_Finalize, contrary to the content of mpirun  
> man-page. I have currently "solved" the problem by writing a .so that is  
> LD_PRELOAD:ed, checking whether MPI_Finalize is indeed called between  
> MPI_Init and exit/_exit. I'd rather not keep this "solution" for too long.  
> If it is indeed so that the mpirun man-page is wrong and the code right,  
> I'd rather push the proper error-handling solution.
> 
> Best regards
> Daniel Spångberg
> 
> 
> On Fri, 17 Aug 2007 18:25:17 +0200, George Bosilca <bosi...@eecs.utk.edu>  
> wrote:
> 
> > The MPI standard state that the correct way to abort/kill an MPI
> > application is using the MPI_Abort function. Except, if you're doing
> > some kind of fault tolerance stuff, there is no reason to end one of
> > your MPI processes via exit.
> >
> >    Thanks,
> >      george.
> >
> > On Aug 16, 2007, at 12:04 PM, Daniel Spångberg wrote:
> >
> >> Dear Open-MPI user list members,
> >>
> >> I am currently having a user with an application where one of the
> >> MPI-processes die, but the openmpi-system does not kill the rest of
> >> the
> >> application.
> >>
> >> Since the mpirun man page states the following I would expect it to
> >> take
> >> care of killing the application if a process exits without calling
> >> MPI_Finalize:
> >>
> >>     Process Termination / Signal Handling
> >>         During  the run of an MPI application, if any rank dies
> >> abnormally
> >> (either exiting before invoking MPI_FINALIZE, or dying as the
> >>         result of a signal), mpirun will print out an error message
> >> and
> >> kill the rest of the MPI application.
> >>
> >> The following test program demonstrates the behaviour (program
> >> hangs until
> >> it is killed by the user or batch system):
> >>
> >> #include <stdio.h>
> >> #include <stdlib.h>
> >> #include <unistd.h>
> >> #include <mpi.h>
> >>
> >> #define RANK_DEATH 1
> >>
> >> int main(int argc, char **argv)
> >> {
> >>    int rank;
> >>    MPI_Init(&argc,&argv);
> >>    MPI_Comm_rank(MPI_COMM_WORLD,&rank);
> >>
> >>    sleep(10);
> >>    if (rank==RANK_DEATH)
> >>      exit(1);
> >>    sleep(10);
> >>    MPI_Finalize();
> >>    return 0;
> >> }
> >>
> >> I have tested this on openmpi 1.2.1 as well as the latest stable
> >> 1.2.3. I
> >> am on Linux x86_64.
> >>
> >> Is this a bug, or are there some flags I can use to force the
> >> mpirun (or
> >> orted, or...) to kill the whole MPI program when this happens?
> >>
> >> If one of the application processes die from a signal (I have
> >> tested SEGV
> >> and FPE) rather than just exiting the whole application is indeed
> >> killed.
> >>
> >> Best regards
> >> Daniel Spångberg
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Process termination problem

Reply via email to