instead of doing dirty with the library you could try to register a cleanup function with atexit.
Thanks, Sven On Friday 17 August 2007 19:59, Daniel Spångberg wrote: > Dear George, > > I think that the best way is to call MPI_Abort. However, this forces the > user to modify the code, which I already have suggested. But their > application is not calling exit directly, I merely wrote the simplest code > that demonstrates the problem. Their application is a Fortran program and > during file IO, when something bad happens, the fortran runtime (pgi) > calls exit (and sometimes _exit for some reason). The file IO is only done > in one process. I have told them to try to add ERR=linelo,END=lineno, > where the code at lineno calls MPI_Abort. This has not happened yet. > Nevertheless, openmpi does not terminate the application when one of > processes exits without MPI_Finalize, contrary to the content of mpirun > man-page. I have currently "solved" the problem by writing a .so that is > LD_PRELOAD:ed, checking whether MPI_Finalize is indeed called between > MPI_Init and exit/_exit. I'd rather not keep this "solution" for too long. > If it is indeed so that the mpirun man-page is wrong and the code right, > I'd rather push the proper error-handling solution. > > Best regards > Daniel Spångberg > > > On Fri, 17 Aug 2007 18:25:17 +0200, George Bosilca <bosi...@eecs.utk.edu> > wrote: > > > The MPI standard state that the correct way to abort/kill an MPI > > application is using the MPI_Abort function. Except, if you're doing > > some kind of fault tolerance stuff, there is no reason to end one of > > your MPI processes via exit. > > > > Thanks, > > george. > > > > On Aug 16, 2007, at 12:04 PM, Daniel Spångberg wrote: > > > >> Dear Open-MPI user list members, > >> > >> I am currently having a user with an application where one of the > >> MPI-processes die, but the openmpi-system does not kill the rest of > >> the > >> application. > >> > >> Since the mpirun man page states the following I would expect it to > >> take > >> care of killing the application if a process exits without calling > >> MPI_Finalize: > >> > >> Process Termination / Signal Handling > >> During the run of an MPI application, if any rank dies > >> abnormally > >> (either exiting before invoking MPI_FINALIZE, or dying as the > >> result of a signal), mpirun will print out an error message > >> and > >> kill the rest of the MPI application. > >> > >> The following test program demonstrates the behaviour (program > >> hangs until > >> it is killed by the user or batch system): > >> > >> #include <stdio.h> > >> #include <stdlib.h> > >> #include <unistd.h> > >> #include <mpi.h> > >> > >> #define RANK_DEATH 1 > >> > >> int main(int argc, char **argv) > >> { > >> int rank; > >> MPI_Init(&argc,&argv); > >> MPI_Comm_rank(MPI_COMM_WORLD,&rank); > >> > >> sleep(10); > >> if (rank==RANK_DEATH) > >> exit(1); > >> sleep(10); > >> MPI_Finalize(); > >> return 0; > >> } > >> > >> I have tested this on openmpi 1.2.1 as well as the latest stable > >> 1.2.3. I > >> am on Linux x86_64. > >> > >> Is this a bug, or are there some flags I can use to force the > >> mpirun (or > >> orted, or...) to kill the whole MPI program when this happens? > >> > >> If one of the application processes die from a signal (I have > >> tested SEGV > >> and FPE) rather than just exiting the whole application is indeed > >> killed. > >> > >> Best regards > >> Daniel Spångberg > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >