This could be related to https://svn.open-mpi.org/trac/ompi/ticket/2714 and/or 
https://svn.open-mpi.org/trac/ompi/ticket/2722.

There isn't much info in the ticket, but we've been talking about it a bunch 
offline.  IBM and Mellanox have had reports of the error, but haven't been able 
to reproduce it reliably.  It *seems* to be a race condition in the "oob" 
connection model of the openib BTL.

If you run with --mca btl_openib_cpc_include rdmacm, does the problem go away?


On Mar 16, 2011, at 11:27 AM, Brock Palen wrote:

> I have a user whos code when ran on ethernet performs fine. When ran on verbs 
> based IB the code deadlocks in an MPI_AllReduce() call.
> 
> We are using openmpi/1.4.3  with the intel compilers.
> 
> I poked at the running code with padb and I get the following:
> 
> 0....5....1....5....2....5....3....5....4....5....
> ,,---,-,-,----,--,--,,-,RRRRRRRR,---,----,,--,-,-,
> ,,-,-,,,-,,--,-,,-,-,-,-RRRRRRRR-,-,---,,,--,,---,
> ,,---,-,,,,-,-,,-,-,----RRRRRRRR,----,-,--,,-----,
> --,,-,-,,,,-,,------,,--RRRRRRRR,,----,,--,------,
> 
> 
> For multiple runs which ranks are stuck in AllReduce() changes, 
> Is there any open bugs?  I found one but only on shared memory and our 
> version should be new enough (from what I could tell) to avoid it.
> 
> Thanks,  what should I look for to diagnose the issue?
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to