On Fri, 04 May 2012, Rolf vandeVaart wrote: > On Behalf Of Don Armstrong > >On Thu, 03 May 2012, Rolf vandeVaart wrote: > >> 2. If that works, then you can also run with a debug switch to > >> see what connections are being made by MPI. > > > >You can see the connections being made in the attached log: > > > >[archimedes:29820] btl: tcp: attempting to connect() to [[60576,1],2] address > >138.23.141.162 on port 2001 > > Yes, I missed that. So, can we simplify the problem. Can you run > with np=2 and one process on each node?
It hangs in exactly the same spot without completing the initial sm-based message. [Specifically, the GUID sending and acking appears to complete on the tcp connection, but the actual traffic is never sent, and the ompi_request_wait_completion(&sendreq->req_send.req_base.req_ompi); never clears). > Also, maybe you can send the ifconfig output from each node. We > sometimes see this type of hanging when a node has two different > interfaces on the same subnet. 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 00:30:48:7d:82:54 brd ff:ff:ff:ff:ff:ff inet 138.23.140.43/23 brd 138.23.141.255 scope global eth0 inet 172.16.30.79/24 brd 172.16.30.255 scope global eth0:1 inet6 fe80::230:48ff:fe7d:8254/64 scope link valid_lft forever preferred_lft forever 3: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN qlen 1000 link/ether 00:30:48:7d:82:55 brd ff:ff:ff:ff:ff:ff inet6 fd74:56b0:69d6::2101/118 scope global valid_lft forever preferred_lft forever inet6 fe80::230:48ff:fe7d:8255/64 scope link valid_lft forever preferred_lft forever 16: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 100 link/none inet 10.134.0.6/24 brd 10.134.0.255 scope global tun0 17: tun1: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 100 link/none inet 10.137.0.201/24 brd 10.137.0.255 scope global tun1 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000 link/ether 00:17:a4:4b:7c:ea brd ff:ff:ff:ff:ff:ff inet 172.16.30.110/24 brd 172.16.30.255 scope global eth0:1 inet 138.23.141.162/23 brd 138.23.141.255 scope global eth0 inet6 fe80::217:a4ff:fe4b:7cea/64 scope link valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 00:17:a4:4b:7c:ec brd ff:ff:ff:ff:ff:ff 7: tun0: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 100 link/none inet 10.134.0.26/24 brd 10.134.0.255 scope global tun0 > Assuming there are multiple interfaces, can you experiment with the runtime > flags outlined here? > http://www.open-mpi.org/faq/?category=tcp#tcp-selection It's already running with btl_tcp_if_include=eth0 and oob_tcp_if_include=eth0; the connections are happening only on eth0, which has the 138.23.141.16 addresses. > Maybe by restricting to specific interfaces you can figure out which > network is the problem. The network doesn't appear to be the problem, unfortunately. Don Armstrong -- I'm So Meta, Even This Acronym -- xkcd http://xkcd.com/917/ http://www.donarmstrong.com http://rzlab.ucr.edu