Re: [OMPI users] Fault Tolerance

George Bosilca Wed, 21 Mar 2007 14:40:01 -0400

What you're looking for is called PVM. Moreover, your requirementsare a mixed bags of FT features that comes from completely differentworlds.

1) Recover any software/hardware crashes ? What kind of recoveryyou're looking for ? What is your definition of recovering ? If whatyou want is to be able to continue to send or receive messages oncethe fault was detected then FT-MPI is the only MPI implementationwhich allow you to consistently continue your execution. To be moreprecise the MPI standard do not define the behavior of MPI libraryonce you get back from the error handler which get called once afault has been detected. As far as I know, the behavior is dependenton the MPI library, and with the exception of FT-MPI no other libraryhave a consistent state after returning from the error handler.

2) Dynamically shrink and grow ? Based on what ? Look like MPI-2dynamic processes except you still have the original MPI_COMM_WORLDwho cannot be shrinked. If what you want is to be able to shrink yourMPI_COMM_WORLD when a fault occur, then again the only solution is FT-MPI.

3) Migrate processes among machines ? What processes ? When and how ?LAM allow you to checkpoint/restart the entire job, and it should bedone before the fault occur. MPICH-V allow transparent non-coordinated checkpointing (i.e. you don't get any notification that afault was detected), but you will pay the cost of message logging. FT-MPI modifies the runtime environment when a fault occurs, but doesnot do migration (if migration means moving the application imagewith all the data into another machine).

Unfortunately, there is no miracle MPI which is able to do all thestuff you're looking for. You need multi-threading and faulttolerance ? I would use FT-MPI with a lock around all MPI functions,something close to the serialized thread mode as defined by the MPIstandard.


  george.

On Mar 21, 2007, at 1:09 PM, Mohammad Huwaidi wrote:

Hello folks,
I am trying to write some fault-tolerance systems with thefollowing criteria:
1) Recover any software/hardware crashes
2) Dynamically Shrink and grow.
3) Migrate processes among machines.
Does anyone has examples of code? What MPI platform is recommendedto accomplish such requirements?
I am using three MPI platforms and each has it own issues:
1) MPICH2 - good multi-threading support, but bad fault-tolerancemechanisms.2) OpenMPI - Does not support multi-threading properly and cannothave it trap exceptions yet.
3) FT-MPI - Old and does not support multi-threading at all.

Any suggestions?
--

Regards,
Mohammad Huwaidi

We can't resolve problems by using the same kind of thinking we used
when we created them.
                                                --Albert Einstein
<mohammad.vcf>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Fault Tolerance

Reply via email to