Thank you all Jeff, Jody, Prentice and Bogdan for your invaluable clarification, solution and suggestion,
Open MPI should return a failure if TCP connectivity is lost, even with a > non-blocking point-to-point operation. The failure should be returned in > the call to MPI_TEST (and friends). even if MPI_TEST is a local operation? > So I'm not sure your timeout has meaning here -- if you reach the timeout, > I think it simply means that the MPI communication has not completed yet. > It does not necessarily mean that the MPI communication has failed. > you are absolutely correct., but the job should be done before it expires. that's the reason I am using TIMEOUT. So the conclusion is : > > MPI doesn't provide any standard way to check reachability and/or health > of a peer process. That's what I wanted to confirm. And to find out the solution, if any, or any alternative. So now I think, I should go for Jody's approach > > How about you start your MPI program from a shell script that does the > following: > > 1. Reads a text file containing the names of all the possible candidates > for MPI nodes > > 2. Loops through the list of names from (1) and pings each machine to > see if it's alive. If the host is pingable, then write it's name to a > different text file which will be host as the machine file for the > mpirun command > > > 3. Call mpirun using the machine file generated in (2). > I am assuming processes have been launched successfully. -- Vipin K. Research Engineer, C-DOTB, India