On 5/4/2012 8:26 AM, Rolf vandeVaart wrote:
2. If that works, then you can also run with a debug switch to see
what connections are being made by MPI.
You can see the connections being made in the attached log:
[archimedes:29820] btl: tcp: attempting to connect() to [[60576,1],2] address
138.23.141.162 on port 2001
Yes, I missed that. So, can we simplify the problem. Can you run with np=2
and one process on each node?
Also, maybe you can send the ifconfig output from each node. We sometimes see
this type of hanging when
a node has two different interfaces on the same subnet.
Assuming there are multiple interfaces, can you experiment with the runtime
flags outlined here?
http://www.open-mpi.org/faq/?category=tcp#tcp-selection
Maybe by restricting to specific interfaces you can figure out which network is
the problem.
Another cause of tcp hangs, if you are on linux, is if the virbr0
interfaces are configured. The tcp btl will incorrectly think that it
can use the virbr interfaces to communicate with other nodes. You
either need to disable the virbr interfaces or exclude them from being
used by the tcp btl.
--
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>