Thank you all Jeff, Jody, Prentice and Bogdan for your invaluable
clarification, solution and suggestion,

Open MPI should return a failure if TCP connectivity is lost, even with a
> non-blocking point-to-point operation.  The failure should be returned in
> the call to MPI_TEST (and friends).


even if MPI_TEST is a local operation?


>  So I'm not sure your timeout has meaning here -- if you reach the timeout,
> I think it simply means that the MPI communication has not completed yet.
>  It does not necessarily mean that the MPI communication has failed.
>

you are absolutely correct., but the job should be done before it expires.
that's the reason I am using TIMEOUT.

So the conclusion is :

>
>  MPI doesn't provide any standard way to check reachability and/or health
> of a peer process.


That's what I wanted to confirm. And to find out the solution, if any, or
any alternative.

So now I think, I should go for Jody's approach


>
> How about you start your MPI program from a shell script that does the
> following:
>
> 1. Reads a text file containing the names of all the possible candidates
>  for MPI nodes
>
> 2. Loops through the list of names from (1) and pings each machine to
> see if it's alive. If the host is pingable, then write it's name to a
> different text file which will be host as the machine file for the
> mpirun command
>


>
> 3. Call mpirun using the machine file generated in (2).
>

I am assuming processes have been launched successfully.



-- 
Vipin K.
Research Engineer,
C-DOTB, India

Reply via email to