1. It is extremely unlike to have a broken MPI communication pipe. Use a
parallel debugger to validate your communication pattern is correct. I
would suspect a deadlock due to an incomplete communication pattern more
than a broken communication pipe.

2. Nope, you cant set timeouts on MPI calls. There was an effort in the
past to push a timeout interface, but it failed. More information about
MPIRT available at http://www.cse.msstate.edu/~yogi/dandass-mpirt-2004.pdf

  George.


On Fri, Sep 19, 2014 at 7:11 PM, Gan, Qi PW <qi.g...@pw.utc.com> wrote:

>  Hi all,
>
>
>
> I have a question about set timeout limit for MPI data transmissions.  Our
> users run their parallel jobs (with openmpi) on our HPC cluster. Sometimes
> the job may hang due to unknown reason. In such case a job is still in
> “RUN” status, all processes of this job are running. But not output is
> produced (in normal a job writes to the output files periodically). We
> suspect that is may be caused by the broken MPI communication pipe, which
> stalls the entire job.
>
>
>
> For example, all processes exchange data at the end of  computations, and
> synchronize by using MPI_waitall function. If  one of the communication
> links (e.g. Ethernet, Infiniband) fails and the system is not able to
> detect it, then all processes are staying with MPI_waitall indefinitely.
> The whole job still looks “running” but it doesn’t do anything useful.
>
>
>
> My question is: is it possible to set up “timeout” flag in MPI functions
> so that if the time spent by a function (e.g. MPI_waitall) exceeds the
> preset timeout limit then the function is aborted and the whole job is
> terminated?
>
>
>
> In our environment, we use OpenMPI v1.4.5 and v1.6.5 on Linux platform.
> The job scheduling tool is LSF v8.4.
>
>
>
> Thanks for the help,
>
>
>
> Qi
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/09/25363.php
>

Reply via email to