Hello Ed, Could you post the output of ompi_info? It would also help to know which variant of the collective ops your doing. If you could post the output when you run with
mpirun --mca coll_base_verbose 10 "other mpirun args you've been using" that would be great Also, if you know the sizes (number of elements) involved in the reduce and allreduce operations it would be helpful to know this as well. Thanks, Howard 2014-09-25 3:34 GMT-06:00 Blosch, Edwin L <edwin.l.blo...@lmco.com>: > I had an application suddenly stop making progress. By killing the last > process out of 208 processes, then looking at the stack trace, I found 3 of > 208 processes were in an MPI_REDUCE call. The other 205 had progressed in > their execution to another routine, where they were waiting in an unrelated > MPI_ALLREDUCE call. > > > > The code structure is such that each processes calls MPI_REDUCE 5 times > for different variables, then some work is done, then the MPI_ALLREDUCE > call happens early in the next iteration of the solution procedure. I > thought it was also noteworthy that the 3 processes stuck at MPI_REDUCE, > were actually stuck on the 4th of 5 MPI_REDUCE calls, not the 5th call. > > > > No issues with MVAPICH. Problem easily solved by adding MPI_BARRIER after > the section of MPI_REDUCE calls. > > > > It seems like MPI_REDUCE has some kind of non-blocking implementation, and > it was not safe to enter the MPI_ALLREDUCE while those MPI_REDUCE calls had > not yet completed for other processes. > > > > This was in OpenMPI 1.8.1. Same problem seen on 3 slightly different > systems, all QDR Infiniband, Mellanox HCAs, using a Mellanox OFED stack > (slightly different versions on each cluster). Intel compilers, again > slightly different versions on each of the 3 systems. > > > > Has anyone encountered anything similar? While I have a workaround, I > want to make sure the root cause of the deadlock gets fixed. Please let me > know what I can do to help. > > > > Thanks, > > > > Ed > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25389.php >