Alberto, In the master there is no such support (we had support for migration a while back, but we have stripped it out). However, at UTK we developed a fork of Open MPI, called ULFM, which provides fault management capabilities. This fork provides support to detect failures, and support for handling the fault in the MPI layer.
I suggest you look at fault-tolerance.org for more info. George. On Mon, Feb 27, 2017 at 11:23 AM, Alberto Ortiz <alberto.orti...@gmail.com> wrote: > Hi, > I am interested in using OpenMPI to manage the distribution on a MicroZed > cluster. This MicroZed boards come with a Zynq device, which has a > dual-core ARM cortex A9. One of the objectives of the project I am working > on is resilience, so I am trully interested in the fault tolerance provided > by OpenMPI. > > The thing I want to know is if there is any implementation for run-time > migration. For instance, if I have an octa-MicroZed cluster running an MPI > job and I unplug the Ethernet cable of one of them or I reboot another one, > is there any support in OpenMPI to detect these failures and migrate the > ranks to other processors on run-time execution? > > Thank you in advance, > Alberto. > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users