Alberto,

In the master there is no such support (we had support for migration a
while back, but we have stripped it out). However, at UTK we developed a
fork of Open MPI, called ULFM,  which provides fault management
capabilities. This fork provides support to detect failures, and support
for handling the fault in the MPI layer.

I suggest you look at fault-tolerance.org for more info.

  George.


On Mon, Feb 27, 2017 at 11:23 AM, Alberto Ortiz <alberto.orti...@gmail.com>
wrote:

> Hi,
> I am interested in using OpenMPI to manage the distribution on a MicroZed
> cluster. This MicroZed boards come with a Zynq device, which has a
> dual-core ARM cortex A9. One of the objectives of the project I am working
> on is resilience, so I am trully interested in the fault tolerance provided
> by OpenMPI.
>
> The thing I want to know is if there is any implementation for run-time
> migration. For instance, if I have an octa-MicroZed cluster running an MPI
> job and I unplug the Ethernet cable of one of them or I reboot another one,
> is there any support in OpenMPI to detect these failures and migrate the
> ranks to other processors on run-time execution?
>
> Thank you in advance,
> Alberto.
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to