Running with rdmacm the problem does seam to resolve its self, The code is large and complicated, but the problem does appear to arise regularly when ran.
Just FYI, can I collect extra information to help find a fix? Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 On Mar 16, 2011, at 8:38 PM, Jeff Squyres wrote: > This could be related to https://svn.open-mpi.org/trac/ompi/ticket/2714 > and/or https://svn.open-mpi.org/trac/ompi/ticket/2722. > > There isn't much info in the ticket, but we've been talking about it a bunch > offline. IBM and Mellanox have had reports of the error, but haven't been > able to reproduce it reliably. It *seems* to be a race condition in the > "oob" connection model of the openib BTL. > > If you run with --mca btl_openib_cpc_include rdmacm, does the problem go away? > > > On Mar 16, 2011, at 11:27 AM, Brock Palen wrote: > >> I have a user whos code when ran on ethernet performs fine. When ran on >> verbs based IB the code deadlocks in an MPI_AllReduce() call. >> >> We are using openmpi/1.4.3 with the intel compilers. >> >> I poked at the running code with padb and I get the following: >> >> 0....5....1....5....2....5....3....5....4....5.... >> ,,---,-,-,----,--,--,,-,RRRRRRRR,---,----,,--,-,-, >> ,,-,-,,,-,,--,-,,-,-,-,-RRRRRRRR-,-,---,,,--,,---, >> ,,---,-,,,,-,-,,-,-,----RRRRRRRR,----,-,--,,-----, >> --,,-,-,,,,-,,------,,--RRRRRRRR,,----,,--,------, >> >> >> For multiple runs which ranks are stuck in AllReduce() changes, >> Is there any open bugs? I found one but only on shared memory and our >> version should be new enough (from what I could tell) to avoid it. >> >> Thanks, what should I look for to diagnose the issue? >> >> Brock Palen >> www.umich.edu/~brockp >> Center for Advanced Computing >> bro...@umich.edu >> (734)936-1985 >> >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >