Running with rdmacm the problem does seam to resolve its self,
The code is large and complicated, but the problem does appear to arise 
regularly when ran.

Just FYI, can I collect extra information to help find a fix?

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On Mar 16, 2011, at 8:38 PM, Jeff Squyres wrote:

> This could be related to https://svn.open-mpi.org/trac/ompi/ticket/2714 
> and/or https://svn.open-mpi.org/trac/ompi/ticket/2722.
> 
> There isn't much info in the ticket, but we've been talking about it a bunch 
> offline.  IBM and Mellanox have had reports of the error, but haven't been 
> able to reproduce it reliably.  It *seems* to be a race condition in the 
> "oob" connection model of the openib BTL.
> 
> If you run with --mca btl_openib_cpc_include rdmacm, does the problem go away?
> 
> 
> On Mar 16, 2011, at 11:27 AM, Brock Palen wrote:
> 
>> I have a user whos code when ran on ethernet performs fine. When ran on 
>> verbs based IB the code deadlocks in an MPI_AllReduce() call.
>> 
>> We are using openmpi/1.4.3  with the intel compilers.
>> 
>> I poked at the running code with padb and I get the following:
>> 
>> 0....5....1....5....2....5....3....5....4....5....
>> ,,---,-,-,----,--,--,,-,RRRRRRRR,---,----,,--,-,-,
>> ,,-,-,,,-,,--,-,,-,-,-,-RRRRRRRR-,-,---,,,--,,---,
>> ,,---,-,,,,-,-,,-,-,----RRRRRRRR,----,-,--,,-----,
>> --,,-,-,,,,-,,------,,--RRRRRRRR,,----,,--,------,
>> 
>> 
>> For multiple runs which ranks are stuck in AllReduce() changes, 
>> Is there any open bugs?  I found one but only on shared memory and our 
>> version should be new enough (from what I could tell) to avoid it.
>> 
>> Thanks,  what should I look for to diagnose the issue?
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 


Reply via email to