Re: [OMPI users] Exit Program Without Calling MPI_Finalize For Special Case

Ralph Castain Wed, 3 Jun 2009 05:19:48 -0400

I'm afraid there is no way to do this in 1.3.2 (or any OMPIdistributed release) with MPI applications.

The OMPI trunk does provide continuous re-spawn of failed processes,mapping them to other nodes and considering fault relationshipsbetween nodes, but this only works if they are -not- MPI processes. Ican detail that for you, if you would like.

The problem with MPI processes is that restart is a much largerproblem than just re-spawning a process. The entire MPI system becomesout-of-sync when one process fails - messages in-flight can be lost,collectives hang, etc.

Even if you rewire the connections after re-spawning the process, youstill have the problem of re-synchronizing the MPI communications -recovering lost messages, determining if a collective is already inoperation and waiting for this process to respond, etc. Hence, ourdefault response is to simply terminate the job, letting the userrestart it from some prior checkpoint.

Of course, the issue of how to recover from a single process failureremains the subject of considerable research. I assume you areengaging in such research?


On Jun 2, 2009, at 10:49 PM, Tee Wen Kai wrote:

Hi,
I am writing a program for a central controller that will spawnprocesses depend on the user selection. And when there is some faultin the spawn processes like for example, the computer that isspawned with the process suddenly go down, the controller shouldreact to this and respawn the processes to available machines.However, when a computer go down, all communications will hang. Toresolve this, the controller will sent SIGTERM signal to kill thosespawned processes. In the spawned program, I have written signalhandler to handle such cases. However, when I include MPI_Finalizein the handler, there will be some error messages when the processesexit because some communication is not complete. Thus, I modify myprogram such that when the processes need to exit through handler,there will be no MPI_Finalize statement. I am using openmpi 1.2.8and this works. However, version 1.2.8 has other bugs like spawnedprocesses using MPI_Comm_spawn when exited does not terminate theorted services leading to alot of orted services when processes arespawn over and over again. Thus, I started evaluating version 1.3.2.1.3.2 solve the bug but the whole program exited once a process exitwithout calling MPI_Finalize. Therefore, I seek your help orsuggestion on how should I overcome this or what should be theproper way to quit processes when they are stuck due to one processdown.
Thank you.

Regards,
Wenkai

New Email names for you!
Get the Email name you've always wanted on the new @ymail and@rocketmail.Hurry before someone else does!_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Exit Program Without Calling MPI_Finalize For Special Case

Reply via email to