On Wed, Jan 30, 2013 at 3:02 AM, Ralph Castain <r...@open-mpi.org> wrote:
> > If your node hardware is the problem, or you decide you do want/need to > pursue an FT solution, then you might look at the OMPI-based solutions from > parties such as http://fault-tolerance.org or the MPICH2 folks. > Just as Ralph said, you may look into alternatives. From what I have seen, MPICH2 provides fault tolerance using BLCR. The same goes for Intel's MPI ( http://software.intel.com/en-us/forums/topic/296300). Though not free, you may try it during a 30-day evaluation period ( http://software.intel.com/en-us/intel-mpi-library/). It can be interesting to see how the two MPI fair wrt to BLCR-based FT. Another alternative which may be worth considering is DMTCP ( http://dmtcp.sourceforge.net/) from Northeastern University for which there has been an interesting podcast recently ( http://www.rce-cast.com/Podcast/rce-76-distributed-multithreaded-checkpointing.html) :-) Finally, depending on the application, you may be interested in adding checkpoint-based fault tolerance at the application level with the help of libraries such as SCR (http://sourceforge.net/projects/scalablecr/). Though you'll need to spend some time modifying the application source code, it may be better than system-level based alternatives in the long run. -- Constantinos