How about changing the default error handler ?
It is not supposed to work, and if you find an MPI implementation
that support this approach please tell me. I know the paper where you
read about this, but even with their MPI library this approach does
not work.
Soon, Open MPI will be able to support this feature. Several fault
tolerant modes are under way, but no precise timeline yet.
Thanks,
george.
On Oct 26, 2006, at 10:19 AM, laurent.po...@fr.thalesgroup.com wrote:
Hi,
I developped a launcher application :
a MPI application (say main_exe) lauches 2 MPI applications (say
exe1 and exe2), using MPI_Comm_spawn_multiple.
Now, I'm looking at the behavior when an exe crashes.
What I can see is the following :
1) when everybody is launched, I see the followings processes,
using 'ps' :
- the 'mpiexec -v -d -n 1 ./main_exe' command
- the orted server used for 'main_exe' (say 'orted1')
- main_exe
- the orted server used for 'exe1' and 'exe2' (say 'orted2')
- exe1
- exe2
2) I use kill -9 to 'crash' exe2
3) orted2 and exe1 finish.
4) with ps, I see it remains the following processes : mpiexec,
'orted1', main_exe
5) main_exe tries to send a message to exe1, using MPI_Bsend :
main_exe gets killed by a SIG_PIPE signal !!!!
So what I see is that when a part of an MPI application crashes,
the whole application crashes !
Is there a way to get an other behavior ? For exemple, MPI_Bsend
could return an error message.
A few additionnal informations :
- I work on linux, with Open-MPI 1.1.1.
- I'm developping in C and C++.
Thanks,
Laurent.
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users