Unfortunately I cannot provide a precise time frame for availability at this point, but we are targeting the v1.5 release series. There is a handful of core developers working on this issue at the moment. Pieces of this work have already made it into the Open MPI development trunk. If you want to play around with what is available try turning on the resilient mapper:
  -mca rmaps resilient

We will be sure to email the list once this work becomes more stable and available.

-- Josh

On Sep 18, 2009, at 2:56 AM, vipin kumar wrote:

Hi Josh,

It is good to hear from you that work is in progress towards resiliency of Open-MPI. I was and I am waiting for this capability in Open-MPI. I have almost finished my development work and waiting for this to happen so that I can test my programs. It will be good if you can tell how long it will take to make Open-MPI a resilient impementation. Here by resiliency I mean abnormal termination or intentionally killing a process should not cause any(parent or sibling) process to be terminated, given that processes are connected.

thanks.

Regards,

On Mon, Aug 3, 2009 at 8:37 PM, Josh Hursey <jjhur...@open-mpi.org> wrote: Task-farm or manager/worker recovery models typically depend on intercommunicators (i.e., from MPI_Comm_spawn) and a resilient MPI implementation. William Gropp and Ewing Lusk have a paper entitled "Fault Tolerance in MPI Programs" that outlines how an application might take advantage of these features in order to recover from process failure.

However, these techniques strongly depend upon resilient MPI implementations, and behaviors that, some may argue, are non- standard. Unfortunately there are not many MPI implementations that are sufficiently resilient in the face of process failure to support failure in task-farm scenarios. Though Open MPI supports the current MPI 2.1 standard, it is not as resilient to process failure as it could be.

There are a number of people working on improving the resiliency of Open MPI in the face of network and process failure (including myself). We have started to move some of the resiliency work into the Open MPI trunk. Resiliency in Open MPI has been improving over the past few months, but I would not assess it as ready quite yet. Most of the work has focused on the runtime level (ORTE), and there are still some MPI level (OMPI) issues that need to be worked out.

With all of that being said, I would try some of the techniques presented in the Gropp/Lusk paper in your application. Then test it with Open MPI and let us know how it goes.

Best,
Josh


On Aug 3, 2009, at 10:30 AM, Durga Choudhury wrote:

Is that kind of approach possible within an MPI framework? Perhaps a
grid approach would be better. More experienced people, speak up,
please?
(The reason I say that is that I too am interested in the solution of
that kind of problem, where an individual blade of a blade server
fails and correcting for that failure on the fly is better than taking
checkpoints and restarting the whole process excluding the failed
blade.

Durga

On Mon, Aug 3, 2009 at 9:21 AM, jody<jody....@gmail.com> wrote:
Hi

I guess "task-farming" could give you a certain amount of the kind of
fault-tolerance you want.
(i.e. a master process distributes tasks to idle slave processors -
however, this will only work
if the slave processes don't need to communicate with each other)

Jody


On Mon, Aug 3, 2009 at 1:24 PM, vipin kumar<vipinkuma...@gmail.com> wrote:
Hi all,

Thanks Durga for your reply.

Jeff, once you wrote code for Mandelbrot set to demonstrate fault tolerance
in LAM-MPI. i. e. killing any slave process doesn't
affect others. Exact behaviour I am looking for in Open MPI. I attempted, but no luck. Can you please tell how to write such programs in Open MPI.

Thanks in advance.

Regards,
On Thu, Jul 9, 2009 at 8:30 PM, Durga Choudhury <dpcho...@gmail.com> wrote:

Although I have perhaps the least experience on the topic in this
list, I will take a shot; more experienced people, please correct me:

MPI standards specify communication mechanism, not fault tolerance at
any level. You may achieve network tolerance at the IP level by
implementing 'equal cost multipath' routes (which means two equally
capable NIC cards connecting to the same destination and modifying the
kernel routing table to use both cards; the kernel will dynamically
load balance.). At the MAC level, you can achieve the same effect by
trunking multiple network cards.

You can achieve process level fault tolerance by a checkpointing
scheme such as BLCR, which has been tested to work with OpenMPI (and
other processes as well)

Durga

On Thu, Jul 9, 2009 at 4:57 AM, vipin kumar<vipinkuma...@gmail.com> wrote:

Hi all,

I want to know whether open mpi supports Network and process fault
tolerance
or not? If there is any example demonstrating these features that will
be
best.

Regards,
--
Vipin K.
Research Engineer,
C-DOTB, India

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Vipin K.
Research Engineer,
C-DOTB, India

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Vipin K.
Research Engineer,
C-DOTB, India
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to