I opened https://svn.open-mpi.org/trac/ompi/ticket/1144 to track this
issue.
On Aug 20, 2007, at 9:04 AM, Daniel Spångberg wrote:
Dear Sven,
I thought about doing that and experimented a bit as well, but
there are
some problems then: I need to relink the users code, registering an
atexit
function is tricky from the fortran code, and I still need to know
whether
MPI_Finalize (and as it turns out MPI_Init as well, otherwise there's
problems with things like call system) has been called before my
atexit
routine is called...
Best regards
Daniel
On Mon, 20 Aug 2007 14:37:44 +0200, Sven Stork <st...@hlrs.de> wrote:
instead of doing dirty with the library you could try to register a
cleanup
function with atexit.
Thanks,
Sven
On Friday 17 August 2007 19:59, Daniel Spångberg wrote:
Dear George,
I think that the best way is to call MPI_Abort. However, this
forces the
user to modify the code, which I already have suggested. But their
application is not calling exit directly, I merely wrote the
simplest
code
that demonstrates the problem. Their application is a Fortran
program
and
during file IO, when something bad happens, the fortran runtime
(pgi)
calls exit (and sometimes _exit for some reason). The file IO is
only
done
in one process. I have told them to try to add
ERR=linelo,END=lineno,
where the code at lineno calls MPI_Abort. This has not happened yet.
Nevertheless, openmpi does not terminate the application when one of
processes exits without MPI_Finalize, contrary to the content of
mpirun
man-page. I have currently "solved" the problem by writing a .so
that is
LD_PRELOAD:ed, checking whether MPI_Finalize is indeed called
between
MPI_Init and exit/_exit. I'd rather not keep this "solution" for too
long.
If it is indeed so that the mpirun man-page is wrong and the code
right,
I'd rather push the proper error-handling solution.
Best regards
Daniel Spångberg
On Fri, 17 Aug 2007 18:25:17 +0200, George Bosilca
<bosi...@eecs.utk.edu>
wrote:
The MPI standard state that the correct way to abort/kill an MPI
application is using the MPI_Abort function. Except, if you're
doing
some kind of fault tolerance stuff, there is no reason to end
one of
your MPI processes via exit.
Thanks,
george.
On Aug 16, 2007, at 12:04 PM, Daniel Spångberg wrote:
Dear Open-MPI user list members,
I am currently having a user with an application where one of the
MPI-processes die, but the openmpi-system does not kill the
rest of
the
application.
Since the mpirun man page states the following I would expect
it to
take
care of killing the application if a process exits without calling
MPI_Finalize:
Process Termination / Signal Handling
During the run of an MPI application, if any rank dies
abnormally
(either exiting before invoking MPI_FINALIZE, or dying as the
result of a signal), mpirun will print out an error
message
and
kill the rest of the MPI application.
The following test program demonstrates the behaviour (program
hangs until
it is killed by the user or batch system):
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>
#define RANK_DEATH 1
int main(int argc, char **argv)
{
int rank;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
sleep(10);
if (rank==RANK_DEATH)
exit(1);
sleep(10);
MPI_Finalize();
return 0;
}
I have tested this on openmpi 1.2.1 as well as the latest stable
1.2.3. I
am on Linux x86_64.
Is this a bug, or are there some flags I can use to force the
mpirun (or
orted, or...) to kill the whole MPI program when this happens?
If one of the application processes die from a signal (I have
tested SEGV
and FPE) rather than just exiting the whole application is indeed
killed.
Best regards
Daniel Spångberg
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems