Hi, I am interested in using OpenMPI to manage the distribution on a MicroZed cluster. This MicroZed boards come with a Zynq device, which has a dual-core ARM cortex A9. One of the objectives of the project I am working on is resilience, so I am trully interested in the fault tolerance provided by OpenMPI.
The thing I want to know is if there is any implementation for run-time migration. For instance, if I have an octa-MicroZed cluster running an MPI job and I unplug the Ethernet cable of one of them or I reboot another one, is there any support in OpenMPI to detect these failures and migrate the ranks to other processors on run-time execution? Thank you in advance, Alberto.
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users