I'm afraid there is no way to do this in 1.3.2 (or any OMPI
distributed release) with MPI applications.
The OMPI trunk does provide continuous re-spawn of failed processes,
mapping them to other nodes and considering fault relationships
between nodes, but this only works if they are -not- MPI processes. I
can detail that for you, if you would like.
The problem with MPI processes is that restart is a much larger
problem than just re-spawning a process. The entire MPI system becomes
out-of-sync when one process fails - messages in-flight can be lost,
collectives hang, etc.
Even if you rewire the connections after re-spawning the process, you
still have the problem of re-synchronizing the MPI communications -
recovering lost messages, determining if a collective is already in
operation and waiting for this process to respond, etc. Hence, our
default response is to simply terminate the job, letting the user
restart it from some prior checkpoint.
Of course, the issue of how to recover from a single process failure
remains the subject of considerable research. I assume you are
engaging in such research?
On Jun 2, 2009, at 10:49 PM, Tee Wen Kai wrote:
Hi,
I am writing a program for a central controller that will spawn
processes depend on the user selection. And when there is some fault
in the spawn processes like for example, the computer that is
spawned with the process suddenly go down, the controller should
react to this and respawn the processes to available machines.
However, when a computer go down, all communications will hang. To
resolve this, the controller will sent SIGTERM signal to kill those
spawned processes. In the spawned program, I have written signal
handler to handle such cases. However, when I include MPI_Finalize
in the handler, there will be some error messages when the processes
exit because some communication is not complete. Thus, I modify my
program such that when the processes need to exit through handler,
there will be no MPI_Finalize statement. I am using openmpi 1.2.8
and this works. However, version 1.2.8 has other bugs like spawned
processes using MPI_Comm_spawn when exited does not terminate the
orted services leading to alot of orted services when processes are
spawn over and over again. Thus, I started evaluating version 1.3.2.
1.3.2 solve the bug but the whole program exited once a process exit
without calling MPI_Finalize. Therefore, I seek your help or
suggestion on how should I overcome this or what should be the
proper way to quit processes when they are stuck due to one process
down.
Thank you.
Regards,
Wenkai
New Email names for you!
Get the Email name you've always wanted on the new @ymail and
@rocketmail.
Hurry before someone else does!
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users