I have a user whos code when ran on ethernet performs fine. When ran on verbs based IB the code deadlocks in an MPI_AllReduce() call.
We are using openmpi/1.4.3 with the intel compilers. I poked at the running code with padb and I get the following: 0....5....1....5....2....5....3....5....4....5.... ,,---,-,-,----,--,--,,-,RRRRRRRR,---,----,,--,-,-, ,,-,-,,,-,,--,-,,-,-,-,-RRRRRRRR-,-,---,,,--,,---, ,,---,-,,,,-,-,,-,-,----RRRRRRRR,----,-,--,,-----, --,,-,-,,,,-,,------,,--RRRRRRRR,,----,,--,------, For multiple runs which ranks are stuck in AllReduce() changes, Is there any open bugs? I found one but only on shared memory and our version should be new enough (from what I could tell) to avoid it. Thanks, what should I look for to diagnose the issue? Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985