This could be related to https://svn.open-mpi.org/trac/ompi/ticket/2714 and/or https://svn.open-mpi.org/trac/ompi/ticket/2722.
There isn't much info in the ticket, but we've been talking about it a bunch offline. IBM and Mellanox have had reports of the error, but haven't been able to reproduce it reliably. It *seems* to be a race condition in the "oob" connection model of the openib BTL. If you run with --mca btl_openib_cpc_include rdmacm, does the problem go away? On Mar 16, 2011, at 11:27 AM, Brock Palen wrote: > I have a user whos code when ran on ethernet performs fine. When ran on verbs > based IB the code deadlocks in an MPI_AllReduce() call. > > We are using openmpi/1.4.3 with the intel compilers. > > I poked at the running code with padb and I get the following: > > 0....5....1....5....2....5....3....5....4....5.... > ,,---,-,-,----,--,--,,-,RRRRRRRR,---,----,,--,-,-, > ,,-,-,,,-,,--,-,,-,-,-,-RRRRRRRR-,-,---,,,--,,---, > ,,---,-,,,,-,-,,-,-,----RRRRRRRR,----,-,--,,-----, > --,,-,-,,,,-,,------,,--RRRRRRRR,,----,,--,------, > > > For multiple runs which ranks are stuck in AllReduce() changes, > Is there any open bugs? I found one but only on shared memory and our > version should be new enough (from what I could tell) to avoid it. > > Thanks, what should I look for to diagnose the issue? > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > bro...@umich.edu > (734)936-1985 > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/