I opened https://svn.open-mpi.org/trac/ompi/ticket/1144 to track this issue.

On Aug 20, 2007, at 9:04 AM, Daniel Spångberg wrote:

Dear Sven,

I thought about doing that and experimented a bit as well, but there are some problems then: I need to relink the users code, registering an atexit function is tricky from the fortran code, and I still need to know whether
MPI_Finalize (and as it turns out MPI_Init as well, otherwise there's
problems with things like call system) has been called before my atexit
routine is called...

Best regards
Daniel

On Mon, 20 Aug 2007 14:37:44 +0200, Sven Stork <st...@hlrs.de> wrote:

instead of doing dirty with the library you could try to register a
cleanup
function with atexit.

Thanks,
  Sven

On Friday 17 August 2007 19:59, Daniel Spångberg wrote:
Dear George,

I think that the best way is to call MPI_Abort. However, this forces the
user to modify the code, which I already have suggested. But their
application is not calling exit directly, I merely wrote the simplest
code
that demonstrates the problem. Their application is a Fortran program
and
during file IO, when something bad happens, the fortran runtime (pgi) calls exit (and sometimes _exit for some reason). The file IO is only
done
in one process. I have told them to try to add ERR=linelo,END=lineno,
where the code at lineno calls MPI_Abort. This has not happened yet.
Nevertheless, openmpi does not terminate the application when one of
processes exits without MPI_Finalize, contrary to the content of mpirun man-page. I have currently "solved" the problem by writing a .so that is LD_PRELOAD:ed, checking whether MPI_Finalize is indeed called between
MPI_Init and exit/_exit. I'd rather not keep this "solution" for too
long.
If it is indeed so that the mpirun man-page is wrong and the code right,
I'd rather push the proper error-handling solution.

Best regards
Daniel Spångberg


On Fri, 17 Aug 2007 18:25:17 +0200, George Bosilca
<bosi...@eecs.utk.edu>
wrote:

The MPI standard state that the correct way to abort/kill an MPI
application is using the MPI_Abort function. Except, if you're doing some kind of fault tolerance stuff, there is no reason to end one of
your MPI processes via exit.

   Thanks,
     george.

On Aug 16, 2007, at 12:04 PM, Daniel Spångberg wrote:

Dear Open-MPI user list members,

I am currently having a user with an application where one of the
MPI-processes die, but the openmpi-system does not kill the rest of
the
application.

Since the mpirun man page states the following I would expect it to
take
care of killing the application if a process exits without calling
MPI_Finalize:

    Process Termination / Signal Handling
        During  the run of an MPI application, if any rank dies
abnormally
(either exiting before invoking MPI_FINALIZE, or dying as the
result of a signal), mpirun will print out an error message
and
kill the rest of the MPI application.

The following test program demonstrates the behaviour (program
hangs until
it is killed by the user or batch system):

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>

#define RANK_DEATH 1

int main(int argc, char **argv)
{
   int rank;
   MPI_Init(&argc,&argv);
   MPI_Comm_rank(MPI_COMM_WORLD,&rank);

   sleep(10);
   if (rank==RANK_DEATH)
     exit(1);
   sleep(10);
   MPI_Finalize();
   return 0;
}

I have tested this on openmpi 1.2.1 as well as the latest stable
1.2.3. I
am on Linux x86_64.

Is this a bug, or are there some flags I can use to force the
mpirun (or
orted, or...) to kill the whole MPI program when this happens?

If one of the application processes die from a signal (I have
tested SEGV
and FPE) rather than just exiting the whole application is indeed
killed.

Best regards
Daniel Spångberg
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems


Reply via email to