Re: [OMPI users] fault tolerance in open mpi

Josh Hursey Wed, 23 Sep 2009 09:11:31 -0400

Unfortunately I cannot provide a precise time frame for availabilityat this point, but we are targeting the v1.5 release series. There isa handful of core developers working on this issue at the moment.Pieces of this work have already made it into the Open MPIdevelopment trunk. If you want to play around with what is availabletry turning on the resilient mapper:

  -mca rmaps resilient

We will be sure to email the list once this work becomes more stableand available.


-- Josh

On Sep 18, 2009, at 2:56 AM, vipin kumar wrote:

Hi Josh,
It is good to hear from you that work is in progress towardsresiliency of Open-MPI. I was and I am waiting for this capabilityin Open-MPI. I have almost finished my development work and waitingfor this to happen so that I can test my programs. It will be goodif you can tell how long it will take to make Open-MPI a resilientimpementation. Here by resiliency I mean abnormal termination orintentionally killing a process should not cause any(parent orsibling) process to be terminated, given that processes are connected.
thanks.

Regards,
On Mon, Aug 3, 2009 at 8:37 PM, Josh Hursey <jjhur...@open-mpi.org>wrote:Task-farm or manager/worker recovery models typically depend onintercommunicators (i.e., from MPI_Comm_spawn) and a resilient MPIimplementation. William Gropp and Ewing Lusk have a paper entitled"Fault Tolerance in MPI Programs" that outlines how an applicationmight take advantage of these features in order to recover fromprocess failure.
However, these techniques strongly depend upon resilient MPIimplementations, and behaviors that, some may argue, are non-standard. Unfortunately there are not many MPI implementations thatare sufficiently resilient in the face of process failure to supportfailure in task-farm scenarios. Though Open MPI supports the currentMPI 2.1 standard, it is not as resilient to process failure as itcould be.
There are a number of people working on improving the resiliency ofOpen MPI in the face of network and process failure (includingmyself). We have started to move some of the resiliency work intothe Open MPI trunk. Resiliency in Open MPI has been improving overthe past few months, but I would not assess it as ready quite yet.Most of the work has focused on the runtime level (ORTE), and thereare still some MPI level (OMPI) issues that need to be worked out.
With all of that being said, I would try some of the techniquespresented in the Gropp/Lusk paper in your application. Then test itwith Open MPI and let us know how it goes.
Best,
Josh


On Aug 3, 2009, at 10:30 AM, Durga Choudhury wrote:

Is that kind of approach possible within an MPI framework? Perhaps a
grid approach would be better. More experienced people, speak up,
please?
(The reason I say that is that I too am interested in the solution of
that kind of problem, where an individual blade of a blade server
fails and correcting for that failure on the fly is better than taking
checkpoints and restarting the whole process excluding the failed
blade.

Durga

On Mon, Aug 3, 2009 at 9:21 AM, jody<jody....@gmail.com> wrote:
Hi

I guess "task-farming" could give you a certain amount of the kind of
fault-tolerance you want.
(i.e. a master process distributes tasks to idle slave processors -
however, this will only work
if the slave processes don't need to communicate with each other)

Jody
On Mon, Aug 3, 2009 at 1:24 PM, vipin kumar<vipinkuma...@gmail.com>wrote:
Hi all,

Thanks Durga for your reply.
Jeff, once you wrote code for Mandelbrot set to demonstrate faulttolerance
in LAM-MPI. i. e. killing any slave process doesn't
affect others. Exact behaviour I am looking for in Open MPI. Iattempted,but no luck. Can you please tell how to write such programs in OpenMPI.
Thanks in advance.

Regards,
On Thu, Jul 9, 2009 at 8:30 PM, Durga Choudhury <dpcho...@gmail.com>wrote:
Although I have perhaps the least experience on the topic in this
list, I will take a shot; more experienced people, please correct me:

MPI standards specify communication mechanism, not fault tolerance at
any level. You may achieve network tolerance at the IP level by
implementing 'equal cost multipath' routes (which means two equally
capable NIC cards connecting to the same destination and modifying the
kernel routing table to use both cards; the kernel will dynamically
load balance.). At the MAC level, you can achieve the same effect by
trunking multiple network cards.

You can achieve process level fault tolerance by a checkpointing
scheme such as BLCR, which has been tested to work with OpenMPI (and
other processes as well)

Durga
On Thu, Jul 9, 2009 at 4:57 AM, vipin kumar<vipinkuma...@gmail.com>wrote:
Hi all,

I want to know whether open mpi supports Network and process fault
tolerance
or not? If there is any example demonstrating these features that will
be
best.

Regards,
--
Vipin K.
Research Engineer,
C-DOTB, India

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Vipin K.
Research Engineer,
C-DOTB, India

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Vipin K.
Research Engineer,
C-DOTB, India
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] fault tolerance in open mpi

Reply via email to