Running with rdmacm the problem does seam to resolve its self,
The code is large and complicated, but the problem does appear to arise
regularly when ran.
Just FYI, can I collect extra information to help find a fix?
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.ed
This could be related to https://svn.open-mpi.org/trac/ompi/ticket/2714 and/or
https://svn.open-mpi.org/trac/ompi/ticket/2722.
There isn't much info in the ticket, but we've been talking about it a bunch
offline. IBM and Mellanox have had reports of the error, but haven't been able
to reproduc
I have a user whos code when ran on ethernet performs fine. When ran on verbs
based IB the code deadlocks in an MPI_AllReduce() call.
We are using openmpi/1.4.3 with the intel compilers.
I poked at the running code with padb and I get the following:
051525354...