On Wed, Jan 30, 2013 at 3:02 AM, Ralph Castain <r...@open-mpi.org> wrote:

>
> If your node hardware is the problem, or you decide you do want/need to
> pursue an FT solution, then you might look at the OMPI-based solutions from
> parties such as http://fault-tolerance.org or the MPICH2 folks.
>

Just as Ralph said, you may look into alternatives. From what I have seen,
MPICH2 provides fault tolerance using BLCR.
The same goes for Intel's MPI (
http://software.intel.com/en-us/forums/topic/296300). Though not free, you
may try it during
a 30-day evaluation period (
http://software.intel.com/en-us/intel-mpi-library/).
It can be interesting to see how the two MPI fair wrt to BLCR-based FT.

Another alternative which may be worth considering is DMTCP (
http://dmtcp.sourceforge.net/) from Northeastern University
for which there has been an interesting podcast recently (
http://www.rce-cast.com/Podcast/rce-76-distributed-multithreaded-checkpointing.html)
:-)

Finally, depending on the application, you may be interested in adding
checkpoint-based fault tolerance at the application level with the help of
libraries such as SCR (http://sourceforge.net/projects/scalablecr/). Though
you'll need to spend some time modifying the application source code,
it may be better than system-level based alternatives in the long run.

--
Constantinos

Reply via email to