1. It is extremely unlike to have a broken MPI communication pipe. Use a parallel debugger to validate your communication pattern is correct. I would suspect a deadlock due to an incomplete communication pattern more than a broken communication pipe.
2. Nope, you cant set timeouts on MPI calls. There was an effort in the past to push a timeout interface, but it failed. More information about MPIRT available at http://www.cse.msstate.edu/~yogi/dandass-mpirt-2004.pdf George. On Fri, Sep 19, 2014 at 7:11 PM, Gan, Qi PW <qi.g...@pw.utc.com> wrote: > Hi all, > > > > I have a question about set timeout limit for MPI data transmissions. Our > users run their parallel jobs (with openmpi) on our HPC cluster. Sometimes > the job may hang due to unknown reason. In such case a job is still in > “RUN” status, all processes of this job are running. But not output is > produced (in normal a job writes to the output files periodically). We > suspect that is may be caused by the broken MPI communication pipe, which > stalls the entire job. > > > > For example, all processes exchange data at the end of computations, and > synchronize by using MPI_waitall function. If one of the communication > links (e.g. Ethernet, Infiniband) fails and the system is not able to > detect it, then all processes are staying with MPI_waitall indefinitely. > The whole job still looks “running” but it doesn’t do anything useful. > > > > My question is: is it possible to set up “timeout” flag in MPI functions > so that if the time spent by a function (e.g. MPI_waitall) exceeds the > preset timeout limit then the function is aborted and the whole job is > terminated? > > > > In our environment, we use OpenMPI v1.4.5 and v1.6.5 on Linux platform. > The job scheduling tool is LSF v8.4. > > > > Thanks for the help, > > > > Qi > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25363.php >