What you're looking for is called PVM. Moreover, your requirements are a mixed bags of FT features that comes from completely different worlds.

1) Recover any software/hardware crashes ? What kind of recovery you're looking for ? What is your definition of recovering ? If what you want is to be able to continue to send or receive messages once the fault was detected then FT-MPI is the only MPI implementation which allow you to consistently continue your execution. To be more precise the MPI standard do not define the behavior of MPI library once you get back from the error handler which get called once a fault has been detected. As far as I know, the behavior is dependent on the MPI library, and with the exception of FT-MPI no other library have a consistent state after returning from the error handler.

2) Dynamically shrink and grow ? Based on what ? Look like MPI-2 dynamic processes except you still have the original MPI_COMM_WORLD who cannot be shrinked. If what you want is to be able to shrink your MPI_COMM_WORLD when a fault occur, then again the only solution is FT- MPI.

3) Migrate processes among machines ? What processes ? When and how ? LAM allow you to checkpoint/restart the entire job, and it should be done before the fault occur. MPICH-V allow transparent non- coordinated checkpointing (i.e. you don't get any notification that a fault was detected), but you will pay the cost of message logging. FT- MPI modifies the runtime environment when a fault occurs, but does not do migration (if migration means moving the application image with all the data into another machine).

Unfortunately, there is no miracle MPI which is able to do all the stuff you're looking for. You need multi-threading and fault tolerance ? I would use FT-MPI with a lock around all MPI functions, something close to the serialized thread mode as defined by the MPI standard.

  george.

On Mar 21, 2007, at 1:09 PM, Mohammad Huwaidi wrote:

Hello folks,

I am trying to write some fault-tolerance systems with the following criteria:
1) Recover any software/hardware crashes
2) Dynamically Shrink and grow.
3) Migrate processes among machines.

Does anyone has examples of code? What MPI platform is recommended to accomplish such requirements?

I am using three MPI platforms and each has it own issues:
1) MPICH2 - good multi-threading support, but bad fault-tolerance mechanisms. 2) OpenMPI - Does not support multi-threading properly and cannot have it trap exceptions yet.
3) FT-MPI - Old and does not support multi-threading at all.

Any suggestions?
--

Regards,
Mohammad Huwaidi

We can't resolve problems by using the same kind of thinking we used
when we created them.
                                                --Albert Einstein
<mohammad.vcf>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to