On Jul 23, 2009, at 7:36 AM, vipin kumar wrote:
I can't use blocking communication routines in my main program ( "masterprocess") because any type of network failure( may be due to physical connectivity or TCP connectivity or MPI connection as you told) may occur. So I am using non blocking point to point communication routines, and TEST later for completion of that Request. Once I enter a TEST loop I will test for Request complition till TIMEOUT. Suppose TIMEOUT has occured, In this case first I will check whether
Open MPI should return a failure if TCP connectivity is lost, even with a non-blocking point-to-point operation. The failure should be returned in the call to MPI_TEST (and friends). So I'm not sure your timeout has meaning here -- if you reach the timeout, I think it simply means that the MPI communication has not completed yet. It does not necessarily mean that the MPI communication has failed.
1: Slave machine is reachable or not, (How I will do that ??? Given - I have IP address and Host Name of Slave machine.)
2: if reachable, check whether program(orted and "slaveprocess") is alive or not.
MPI doesn't provide any standard way to check reachability and/or health of a peer process.
That being said, I think some of the academics are working on more fault tolerant / resilient MPI messaging, but I don't know if they're ready to talk about such efforts publicly yet.
-- Jeff Squyres jsquy...@cisco.com