I am pleased to announce that Open MPI now supports checkpoint/
restart process fault tolerance. This new feature is supported on the
current development trunk as of r14519. This new feature is currently
scheduled for release in the version 1.3 series of Open MPI.
The current implementation includes support for fully coordinated
checkpoint/restart operation (somewhat similar to the LAM/MPI
implementation). We support checkpoint/restart with the Berkeley Lab
Checkpoint/Restart (BLCR) system, and a specialized SELF component
used support application level checkpoint/restart operations.
By default checkpoint/restart process fault tolerance is compiled out
and disabled at runtime. For information on how to enable and
properly use this new feature please refer to the Checkpoint/Restart
Users Guide draft attached to the Wiki page:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
In addition to the checkpoint/restart users guide, the Wiki entry
also describes the current status of and updates regarding the
development of this new feature.
If you have any questions or problems using checkpoint/restart
process fault tolerance in Open MPI please send them to the users and
developers lists.
Cheers,
Josh
----
Josh Hursey
jjhur...@open-mpi.org
http://www.open-mpi.org/