Open MPI's fault tolerance support is fairly rudimentary. If you kill any process without calling MPI_Finalize, Open MPI will -- by default -- kill all the others in the job.
Various research work is ongoing to improve fault tolerance in Open MPI, but I don't know the state of it in terms of surviving a failed process. I *think* that this kind of stuff is not ready for prime time, but I admit that this is not an area that I pay close attention to. On Jun 23, 2010, at 3:08 AM, Randolph Pullen wrote: > That is effectively what I have done by changing to the immediate > send/receive and waiting in a loop a finite number of times for the transfers > to complete - and calling MPI_abort if they do not complete in a set time. > It is not clear how I can kill mpirun in a manner consistent with the API. > Are you implying I should call exit() rather than MPI_abort? > > --- On Wed, 23/6/10, David Zhang <solarbik...@gmail.com> wrote: > > From: David Zhang <solarbik...@gmail.com> > Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun > To: "Open MPI Users" <us...@open-mpi.org> > Received: Wednesday, 23 June, 2010, 4:37 PM > > Since you turned the machine off instead of just killing one of the > processes, no signals could be sent to other processes. Perhaps you could > institute some sort of handshaking in your software that periodically check > for the attendance of all machines, and timeout if not all are present within > some alloted time? > > On Tue, Jun 22, 2010 at 10:43 PM, Randolph Pullen > <randolph_pul...@yahoo.com.au> wrote: > > I have a mpi program that aggregates data from multiple sql systems. It all > runs fine. To test fault tolerance I switch one of the machines off while it > is running. The result is always a hang, ie mpirun never completes. > > To try and avoid this I have replaced the send and receive calls with > immediate calls (ie MPI_Isend, MPI_Irecv) to try and trap long waiting sends > and receives but it makes no difference. > My requirement is that all complete or mpirun exits with an error - no matter > where they are in their execution when a failure occurs. This system must > continue (ie fail) if a machine dies, regroup and re-cast the job over the > remaining nodes. > > I am running FC10, gcc 4.3.2 and openMPI 1.4.1 > 4G RAM, dual core intel all x86_64 > > > =============================================================================================================== > The commands I have tried: > mpirun -hostfile ~/mpd.hosts -np 6 ./ingsprinkle test t3 "select * from > tab" > > mpirun -mca btl ^sm -hostfile ~/mpd.hosts -np 6 ./ingsprinkle test t3 > "select * from tab" > > mpirun -mca orte_forward_job_control 1 -hostfile ~/mpd.hosts -np 6 > ./ingsprinkle test t3 "select * from tab" > > > =============================================================================================================== > > The results: > recv returned 0 with status 0 > waited # 2000002 tiumes - now status is 0 flag is -1976147192 > -------------------------------------------------------------------------- > MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > with errorcode 5. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun has exited due to process rank 0 with PID 29141 on > node bd01 exiting without calling "finalize". This may > have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > -------------------------------------------------------------------------- > > [*** wait a long time ***] > [bd01:29136] [[55293,0],0]-[[55293,0],1] mca_oob_tcp_msg_recv: readv failed: > Connection reset by peer (104) > > ^Cmpirun: abort is already in progress...hit ctrl-c again to forcibly > terminate > > > =============================================================================================================== > > As you can see, my trap can signal an abort, the tcp layer can time out but > mpirun just keeps on running... > > Any help greatly appreciated.. > Vlad > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > David Zhang > University of California, San Diego > > -----Inline Attachment Follows----- > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/