Disconnect is a -collective- operation. Both parent and child have to call it. Your child process is "hanging" while it waits for the parent.
On Dec 21, 2009, at 1:37 AM, vipin kumar wrote: > Hello folks, > > As I explained my problem earlier, I am looking for Fault Tolerance in MPI > Programs. I read in Open MPI 2.1 standard document that two DISCONNECTED > processes does not affect each other, i.e. they can die or can be killed > without whithout affecting other processes. > > So, I was trying this to achieve fault tolerance using > MPI::Comm::Disconnect() to disconnect the CHILD process with PARENT process, > which was spawned by calling MPI::Comm::spawn(). I am calling > MPI::Comm::Disconnect() from CHILD process immediatly after calling > MPI::Init(). It seems that CHILD process is not returning from this call. > > I tried MPI::Comm::Free() too, but this is also not working. Process is not > progressing from this point of call. If I comment these statements, > everything works fine. Note that I have tried this in Solaris as well as in > Linux (fedora core). > > My question is, whether Open-mpi suports to disconnect two processes( like > child from parent). And if it is, then how? > > > Thanks & Regards, > > On Wed, Sep 23, 2009 at 6:41 PM, Josh Hursey <[email protected]> wrote: > Unfortunately I cannot provide a precise time frame for availability at this > point, but we are targeting the v1.5 release series. There is a handful of > core developers working on this issue at the moment. Pieces of this work > have already made it into the Open MPI development trunk. If you want to play > around with what is available try turning on the resilient mapper: > -mca rmaps resilient > > We will be sure to email the list once this work becomes more stable and > available. > > -- Josh > > > On Sep 18, 2009, at 2:56 AM, vipin kumar wrote: > > Hi Josh, > > It is good to hear from you that work is in progress towards resiliency of > Open-MPI. I was and I am waiting for this capability in Open-MPI. I have > almost finished my development work and waiting for this to happen so that I > can test my programs. It will be good if you can tell how long it will take > to make Open-MPI a resilient impementation. Here by resiliency I mean > abnormal termination or intentionally killing a process should not cause > any(parent or sibling) process to be terminated, given that processes are > connected. > > thanks. > > Regards, > > On Mon, Aug 3, 2009 at 8:37 PM, Josh Hursey <[email protected]> wrote: > Task-farm or manager/worker recovery models typically depend on > intercommunicators (i.e., from MPI_Comm_spawn) and a resilient MPI > implementation. William Gropp and Ewing Lusk have a paper entitled "Fault > Tolerance in MPI Programs" that outlines how an application might take > advantage of these features in order to recover from process failure. > > However, these techniques strongly depend upon resilient MPI implementations, > and behaviors that, some may argue, are non-standard. Unfortunately there are > not many MPI implementations that are sufficiently resilient in the face of > process failure to support failure in task-farm scenarios. Though Open MPI > supports the current MPI 2.1 standard, it is not as resilient to process > failure as it could be. > > There are a number of people working on improving the resiliency of Open MPI > in the face of network and process failure (including myself). We have > started to move some of the resiliency work into the Open MPI trunk. > Resiliency in Open MPI has been improving over the past few months, but I > would not assess it as ready quite yet. Most of the work has focused on the > runtime level (ORTE), and there are still some MPI level (OMPI) issues that > need to be worked out. > > With all of that being said, I would try some of the techniques presented in > the Gropp/Lusk paper in your application. Then test it with Open MPI and let > us know how it goes. > > Best, > Josh > > > On Aug 3, 2009, at 10:30 AM, Durga Choudhury wrote: > > Is that kind of approach possible within an MPI framework? Perhaps a > grid approach would be better. More experienced people, speak up, > please? > (The reason I say that is that I too am interested in the solution of > that kind of problem, where an individual blade of a blade server > fails and correcting for that failure on the fly is better than taking > checkpoints and restarting the whole process excluding the failed > blade. > > Durga > > On Mon, Aug 3, 2009 at 9:21 AM, jody<[email protected]> wrote: > Hi > > I guess "task-farming" could give you a certain amount of the kind of > fault-tolerance you want. > (i.e. a master process distributes tasks to idle slave processors - > however, this will only work > if the slave processes don't need to communicate with each other) > > Jody > > > On Mon, Aug 3, 2009 at 1:24 PM, vipin kumar<[email protected]> wrote: > Hi all, > > Thanks Durga for your reply. > > Jeff, once you wrote code for Mandelbrot set to demonstrate fault tolerance > in LAM-MPI. i. e. killing any slave process doesn't > affect others. Exact behaviour I am looking for in Open MPI. I attempted, > but no luck. Can you please tell how to write such programs in Open MPI. > > Thanks in advance. > > Regards, > On Thu, Jul 9, 2009 at 8:30 PM, Durga Choudhury <[email protected]> wrote: > > Although I have perhaps the least experience on the topic in this > list, I will take a shot; more experienced people, please correct me: > > MPI standards specify communication mechanism, not fault tolerance at > any level. You may achieve network tolerance at the IP level by > implementing 'equal cost multipath' routes (which means two equally > capable NIC cards connecting to the same destination and modifying the > kernel routing table to use both cards; the kernel will dynamically > load balance.). At the MAC level, you can achieve the same effect by > trunking multiple network cards. > > You can achieve process level fault tolerance by a checkpointing > scheme such as BLCR, which has been tested to work with OpenMPI (and > other processes as well) > > Durga > > On Thu, Jul 9, 2009 at 4:57 AM, vipin kumar<[email protected]> wrote: > > Hi all, > > I want to know whether open mpi supports Network and process fault > tolerance > or not? If there is any example demonstrating these features that will > be > best. > > Regards, > -- > Vipin K. > Research Engineer, > C-DOTB, India > > _______________________________________________ > users mailing list > [email protected] > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > [email protected] > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > Vipin K. > Research Engineer, > C-DOTB, India > > _______________________________________________ > users mailing list > [email protected] > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > [email protected] > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > [email protected] > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > [email protected] > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > Vipin K. > Research Engineer, > C-DOTB, India > _______________________________________________ > users mailing list > [email protected] > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > [email protected] > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > Vipin K. > Research Engineer, > C-DOTB, India > _______________________________________________ > users mailing list > [email protected] > http://www.open-mpi.org/mailman/listinfo.cgi/users
