It's supposed to, so it sounds like we have a bug in the connection failover 
mechanism. I'll address it

On Jul 23, 2014, at 1:21 AM, Timur Ismagilov <tismagi...@mail.ru> wrote:

> Thanks, Ralph!
> When I add --mca oob_tcp_if_include ib0 (where ib0 is infiniband interface 
> from ifconfig) to mpirun it starts working correct! 
> Why OpenMPI doesn't do it itself?
> 
> Tue, 22 Jul 2014 11:26:16 -0700 от Ralph Castain <r...@open-mpi.org>:
> Okay, the problem is that the connection back to mpirun isn't getting thru. 
> We are trying on the 10.0.251.53 address - is that blocked, or should we be 
> using something else? If so, you might want to direct us by adding "-mca 
> oob_tcp_if_include foo", where foo is the interface you want us to use
> 
> 
> On Jul 20, 2014, at 10:24 PM, Timur Ismagilov <tismagi...@mail.ru> wrote:
> 
>> NIC = network interface controller? 
>> 
>> There is QDR Infiniband 4x/10G Ethernet/Gigabit Ethernet. 
>> I want to use QDR Infiniband.
>> 
>> Here is a new output:
>> 
>> $ mpirun -mca mca_base_env_list 'LD_PRELOAD' --debug-daemons --mca 
>> plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 2 
>> hello_c |tee hello.out
>> Warning: Conflicting CPU frequencies detected, using: 2927.000000.
>> [compiler-2:30735] mca:base:select:( plm) Querying component [isolated]
>> [compiler-2:30735] mca:base:select:( plm) Query of component [isolated] set 
>> priority to 0
>> [compiler-2:30735] mca:base:select:( plm) Querying component [rsh]
>> [compiler-2:30735] mca:base:select:( plm) Query of component [rsh] set 
>> priority to 10
>> [compiler-2:30735] mca:base:select:( plm) Querying component [slurm]
>> [compiler-2:30735] mca:base:select:( plm) Query of component [slurm] set 
>> priority to 75
>> [compiler-2:30735] mca:base:select:( plm) Selected component [slurm]
>> [compiler-2:30735] mca: base: components_register: registering oob components
>> [compiler-2:30735] mca: base: components_register: found loaded component tcp
>> [compiler-2:30735] mca: base: components_register: component tcp register 
>> function successful
>> [compiler-2:30735] mca: base: components_open: opening oob components
>> [compiler-2:30735] mca: base: components_open: found loaded component tcp
>> [compiler-2:30735] mca: base: components_open: component tcp open function 
>> successful
>> [compiler-2:30735] mca:oob:select: checking available component tcp
>> [compiler-2:30735] mca:oob:select: Querying component [tcp]
>> [compiler-2:30735] oob:tcp: component_available called
>> [compiler-2:30735] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
>> [compiler-2:30735] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
>> [compiler-2:30735] [[65177,0],0] oob:tcp:init adding 10.0.251.53 to our list 
>> of V4 connections
>> [compiler-2:30735] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
>> [compiler-2:30735] [[65177,0],0] oob:tcp:init adding 10.0.0.4 to our list of 
>> V4 connections
>> [compiler-2:30735] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
>> [compiler-2:30735] [[65177,0],0] oob:tcp:init adding 10.2.251.14 to our list 
>> of V4 connections
>> [compiler-2:30735] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
>> [compiler-2:30735] [[65177,0],0] oob:tcp:init adding 10.128.0.4 to our list 
>> of V4 connections
>> [compiler-2:30735] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4
>> [compiler-2:30735] [[65177,0],0] oob:tcp:init adding 93.180.7.38 to our list 
>> of V4 connections
>> [compiler-2:30735] [[65177,0],0] TCP STARTUP
>> [compiler-2:30735] [[65177,0],0] attempting to bind to IPv4 port 0
>> [compiler-2:30735] [[65177,0],0] assigned IPv4 port 49759
>> [compiler-2:30735] mca:oob:select: Adding component to end
>> [compiler-2:30735] mca:oob:select: Found 1 active transports
>> [compiler-2:30735] mca: base: components_register: registering rml components
>> [compiler-2:30735] mca: base: components_register: found loaded component oob
>> [compiler-2:30735] mca: base: components_register: component oob has no 
>> register or open function
>> [compiler-2:30735] mca: base: components_open: opening rml components
>> [compiler-2:30735] mca: base: components_open: found loaded component oob
>> [compiler-2:30735] mca: base: components_open: component oob open function 
>> successful
>> [compiler-2:30735] orte_rml_base_select: initializing rml component oob
>> [compiler-2:30735] [[65177,0],0] posting recv
>> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 30 for peer 
>> [[WILDCARD],WILDCARD]
>> [compiler-2:30735] [[65177,0],0] posting recv
>> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 15 for peer 
>> [[WILDCARD],WILDCARD]
>> [compiler-2:30735] [[65177,0],0] posting recv
>> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 32 for peer 
>> [[WILDCARD],WILDCARD]
>> [compiler-2:30735] [[65177,0],0] posting recv
>> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 33 for peer 
>> [[WILDCARD],WILDCARD]
>> [compiler-2:30735] [[65177,0],0] posting recv
>> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 5 for peer 
>> [[WILDCARD],WILDCARD]
>> [compiler-2:30735] [[65177,0],0] posting recv
>> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 10 for peer 
>> [[WILDCARD],WILDCARD]
>> [compiler-2:30735] [[65177,0],0] posting recv
>> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 12 for peer 
>> [[WILDCARD],WILDCARD]
>> [compiler-2:30735] [[65177,0],0] posting recv
>> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 9 for peer 
>> [[WILDCARD],WILDCARD]
>> [compiler-2:30735] [[65177,0],0] posting recv
>> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 34 for peer 
>> [[WILDCARD],WILDCARD]
>> [compiler-2:30735] [[65177,0],0] posting recv
>> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 2 for peer 
>> [[WILDCARD],WILDCARD]
>> [compiler-2:30735] [[65177,0],0] posting recv
>> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 21 for peer 
>> [[WILDCARD],WILDCARD]
>> [compiler-2:30735] [[65177,0],0] posting recv
>> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 22 for peer 
>> [[WILDCARD],WILDCARD]
>> [compiler-2:30735] [[65177,0],0] posting recv
>> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 45 for peer 
>> [[WILDCARD],WILDCARD]
>> [compiler-2:30735] [[65177,0],0] posting recv
>> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 46 for peer 
>> [[WILDCARD],WILDCARD]
>> [compiler-2:30735] [[65177,0],0] posting recv
>> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 1 for peer 
>> [[WILDCARD],WILDCARD]
>> [compiler-2:30735] [[65177,0],0] posting recv
>> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 27 for peer 
>> [[WILDCARD],WILDCARD]
>> Daemon was launched on node1-128-17 - beginning to initialize
>> Daemon was launched on node1-128-18 - beginning to initialize
>> [node1-128-17:14779] mca: base: components_register: registering oob 
>> components
>> [node1-128-17:14779] mca: base: components_register: found loaded component 
>> tcp
>> [node1-128-17:14779] mca: base: components_register: component tcp register 
>> function successful
>> [node1-128-17:14779] mca: base: components_open: opening oob components
>> [node1-128-17:14779] mca: base: components_open: found loaded component tcp
>> [node1-128-17:14779] mca: base: components_open: component tcp open function 
>> successful
>> [node1-128-17:14779] mca:oob:select: checking available component tcp
>> [node1-128-17:14779] mca:oob:select: Querying component [tcp]
>> [node1-128-17:14779] oob:tcp: component_available called
>> [node1-128-17:14779] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
>> [node1-128-17:14779] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
>> [node1-128-17:14779] [[65177,0],1] oob:tcp:init adding 10.0.128.17 to our 
>> list of V4 connections
>> [node1-128-17:14779] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
>> [node1-128-17:14779] [[65177,0],1] oob:tcp:init adding 10.128.128.17 to our 
>> list of V4 connections
>> [node1-128-17:14779] [[65177,0],1] TCP STARTUP
>> [node1-128-17:14779] [[65177,0],1] attempting to bind to IPv4 port 0
>> [node1-128-17:14779] [[65177,0],1] assigned IPv4 port 46441
>> [node1-128-17:14779] mca:oob:select: Adding component to end
>> [node1-128-17:14779] mca:oob:select: Found 1 active transports
>> [node1-128-17:14779] mca: base: components_register: registering rml 
>> components
>> [node1-128-17:14779] mca: base: components_register: found loaded component 
>> oob
>> [node1-128-17:14779] mca: base: components_register: component oob has no 
>> register or open function
>> [node1-128-17:14779] mca: base: components_open: opening rml components
>> [node1-128-17:14779] mca: base: components_open: found loaded component oob
>> [node1-128-17:14779] mca: base: components_open: component oob open function 
>> successful
>> [node1-128-17:14779] orte_rml_base_select: initializing rml component oob
>> [node1-128-18:17849] mca: base: components_register: registering oob 
>> components
>> [node1-128-18:17849] mca: base: components_register: found loaded component 
>> tcp
>> [node1-128-18:17849] mca: base: components_register: component tcp register 
>> function successful
>> [node1-128-18:17849] mca: base: components_open: opening oob components
>> [node1-128-18:17849] mca: base: components_open: found loaded component tcp
>> [node1-128-18:17849] mca: base: components_open: component tcp open function 
>> successful
>> [node1-128-18:17849] mca:oob:select: checking available component tcp
>> [node1-128-18:17849] mca:oob:select: Querying component [tcp]
>> [node1-128-18:17849] oob:tcp: component_available called
>> [node1-128-18:17849] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
>> [node1-128-18:17849] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
>> [node1-128-18:17849] [[65177,0],2] oob:tcp:init adding 10.0.128.18 to our 
>> list of V4 connections
>> [node1-128-18:17849] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
>> [node1-128-18:17849] [[65177,0],2] oob:tcp:init adding 10.128.128.18 to our 
>> list of V4 connections
>> [node1-128-18:17849] [[65177,0],2] TCP STARTUP
>> [node1-128-18:17849] [[65177,0],2] attempting to bind to IPv4 port 0
>> [node1-128-18:17849] [[65177,0],2] assigned IPv4 port 60695
>> [node1-128-18:17849] mca:oob:select: Adding component to end
>> [node1-128-18:17849] mca:oob:select: Found 1 active transports
>> [node1-128-18:17849] mca: base: components_register: registering rml 
>> components
>> [node1-128-18:17849] mca: base: components_register: found loaded component 
>> oob
>> [node1-128-18:17849] mca: base: components_register: component oob has no 
>> register or open function
>> [node1-128-18:17849] mca: base: components_open: opening rml components
>> [node1-128-18:17849] mca: base: components_open: found loaded component oob
>> [node1-128-18:17849] mca: base: components_open: component oob open function 
>> successful
>> [node1-128-18:17849] orte_rml_base_select: initializing rml component oob
>> Daemon [[65177,0],1] checking in as pid 14779 on host node1-128-17
>> [node1-128-17:14779] [[65177,0],1] orted: up and running - waiting for 
>> commands!
>> [node1-128-17:14779] [[65177,0],1] posting recv
>> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 30 for 
>> peer [[WILDCARD],WILDCARD]
>> [node1-128-17:14779] [[65177,0],1] posting recv
>> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 15 for 
>> peer [[WILDCARD],WILDCARD]
>> [node1-128-17:14779] [[65177,0],1] posting recv
>> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 32 for 
>> peer [[WILDCARD],WILDCARD]
>> [node1-128-17:14779] [[65177,0],1] posting recv
>> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 11 for 
>> peer [[WILDCARD],WILDCARD]
>> [node1-128-17:14779] [[65177,0],1] posting recv
>> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 9 for peer 
>> [[WILDCARD],WILDCARD]
>> [node1-128-17:14779] [[65177,0],1]: set_addr to uri 
>> 4271439872.0;tcp://10.0.251.53,10.0.0.4,10.2.251.14,10.128.0.4,93.180.7.38:49759
>> [node1-128-17:14779] [[65177,0],1]:set_addr checking if peer [[65177,0],0] 
>> is reachable via component tcp
>> [node1-128-17:14779] [[65177,0],1] oob:tcp: working peer [[65177,0],0] 
>> address tcp://10.0.251.53,10.0.0.4,10.2.251.14,10.128.0.4,93.180.7.38:49759
>> [node1-128-17:14779] [[65177,0],1] PASSING ADDR 10.0.251.53 TO MODULE
>> [node1-128-17:14779] [[65177,0],1]:tcp set addr for peer [[65177,0],0]
>> [node1-128-17:14779] [[65177,0],1] PASSING ADDR 10.0.0.4 TO MODULE
>> [node1-128-17:14779] [[65177,0],1]:tcp set addr for peer [[65177,0],0]
>> [node1-128-17:14779] [[65177,0],1] PASSING ADDR 10.2.251.14 TO MODULE
>> [node1-128-17:14779] [[65177,0],1]:tcp set addr for peer [[65177,0],0]
>> [node1-128-17:14779] [[65177,0],1] PASSING ADDR 10.128.0.4 TO MODULE
>> [node1-128-17:14779] [[65177,0],1]:tcp set addr for peer [[65177,0],0]
>> [node1-128-17:14779] [[65177,0],1] PASSING ADDR 93.180.7.38 TO MODULE
>> [node1-128-17:14779] [[65177,0],1]:tcp set addr for peer [[65177,0],0]
>> [node1-128-17:14779] [[65177,0],1]: peer [[65177,0],0] is reachable via 
>> component tcp
>> [node1-128-17:14779] [[65177,0],1] posting recv
>> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 3 for peer 
>> [[WILDCARD],WILDCARD]
>> [node1-128-17:14779] [[65177,0],1] posting recv
>> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 21 for 
>> peer [[WILDCARD],WILDCARD]
>> [node1-128-17:14779] [[65177,0],1] posting recv
>> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 45 for 
>> peer [[WILDCARD],WILDCARD]
>> [node1-128-17:14779] [[65177,0],1] posting recv
>> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 46 for 
>> peer [[WILDCARD],WILDCARD]
>> [node1-128-17:14779] [[65177,0],1] posting recv
>> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 1 for peer 
>> [[WILDCARD],WILDCARD]
>> [node1-128-17:14779] [[65177,0],1] OOB_SEND: rml_oob_send.c:199
>> [node1-128-17:14779] [[65177,0],1]:tcp:processing set_peer cmd
>> [node1-128-17:14779] [[65177,0],1]:tcp:processing set_peer cmd
>> [node1-128-17:14779] [[65177,0],1]:tcp:processing set_peer cmd
>> [node1-128-17:14779] [[65177,0],1]:tcp:processing set_peer cmd
>> [node1-128-17:14779] [[65177,0],1]:tcp:processing set_peer cmd
>> [node1-128-17:14779] [[65177,0],1] oob:base:send to target [[65177,0],0]
>> [node1-128-17:14779] [[65177,0],1] oob:tcp:send_nb to peer [[65177,0],0]:10
>> [node1-128-17:14779] [[65177,0],1] tcp:send_nb to peer [[65177,0],0]
>> [node1-128-17:14779] [[65177,0],1]:[oob_tcp.c:484] post send to [[65177,0],0]
>> [node1-128-17:14779] [[65177,0],1]:[oob_tcp.c:421] processing send to peer 
>> [[65177,0],0]:10
>> [node1-128-17:14779] [[65177,0],1]:[oob_tcp.c:455] queue pending to 
>> [[65177,0],0]
>> [node1-128-17:14779] [[65177,0],1] tcp:send_nb: initiating connection to 
>> [[65177,0],0]
>> [node1-128-17:14779] [[65177,0],1]:[oob_tcp.c:469] connect to [[65177,0],0]
>> [node1-128-17:14779] [[65177,0],1] orte_tcp_peer_try_connect: attempting to 
>> connect to proc [[65177,0],0]
>> [node1-128-17:14779] [[65177,0],1] orte_tcp_peer_try_connect: attempting to 
>> connect to proc [[65177,0],0] on socket 10
>> [node1-128-17:14779] [[65177,0],1] orte_tcp_peer_try_connect: attempting to 
>> connect to proc [[65177,0],0] on 10.0.251.53:49759 - 0 retries
>> [node1-128-17:14779] [[65177,0],1] waiting for connect completion to 
>> [[65177,0],0] - activating send event
>> Daemon [[65177,0],2] checking in as pid 17849 on host node1-128-18
>> [node1-128-18:17849] [[65177,0],2] orted: up and running - waiting for 
>> commands!
>> [node1-128-18:17849] [[65177,0],2] posting recv
>> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 30 for 
>> peer [[WILDCARD],WILDCARD]
>> [node1-128-18:17849] [[65177,0],2] posting recv
>> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 15 for 
>> peer [[WILDCARD],WILDCARD]
>> [node1-128-18:17849] [[65177,0],2] posting recv
>> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 32 for 
>> peer [[WILDCARD],WILDCARD]
>> [node1-128-18:17849] [[65177,0],2] posting recv
>> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 11 for 
>> peer [[WILDCARD],WILDCARD]
>> [node1-128-18:17849] [[65177,0],2] posting recv
>> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 9 for peer 
>> [[WILDCARD],WILDCARD]
>> [node1-128-18:17849] [[65177,0],2]: set_addr to uri 
>> 4271439872.0;tcp://10.0.251.53,10.0.0.4,10.2.251.14,10.128.0.4,93.180.7.38:49759
>> [node1-128-18:17849] [[65177,0],2]:set_addr checking if peer [[65177,0],0] 
>> is reachable via component tcp
>> [node1-128-18:17849] [[65177,0],2] oob:tcp: working peer [[65177,0],0] 
>> address tcp://10.0.251.53,10.0.0.4,10.2.251.14,10.128.0.4,93.180.7.38:49759
>> [node1-128-18:17849] [[65177,0],2] PASSING ADDR 10.0.251.53 TO MODULE
>> [node1-128-18:17849] [[65177,0],2]:tcp set addr for peer [[65177,0],0]
>> [node1-128-18:17849] [[65177,0],2] PASSING ADDR 10.0.0.4 TO MODULE
>> [node1-128-18:17849] [[65177,0],2]:tcp set addr for peer [[65177,0],0]
>> [node1-128-18:17849] [[65177,0],2] PASSING ADDR 10.2.251.14 TO MODULE
>> [node1-128-18:17849] [[65177,0],2]:tcp set addr for peer [[65177,0],0]
>> [node1-128-18:17849] [[65177,0],2] PASSING ADDR 10.128.0.4 TO MODULE
>> [node1-128-18:17849] [[65177,0],2]:tcp set addr for peer [[65177,0],0]
>> [node1-128-18:17849] [[65177,0],2] PASSING ADDR 93.180.7.38 TO MODULE
>> [node1-128-18:17849] [[65177,0],2]:tcp set addr for peer [[65177,0],0]
>> [node1-128-18:17849] [[65177,0],2]: peer [[65177,0],0] is reachable via 
>> component tcp
>> [node1-128-18:17849] [[65177,0],2] posting recv
>> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 3 for peer 
>> [[WILDCARD],WILDCARD]
>> [node1-128-18:17849] [[65177,0],2] posting recv
>> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 21 for 
>> peer [[WILDCARD],WILDCARD]
>> [node1-128-18:17849] [[65177,0],2] posting recv
>> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 45 for 
>> peer [[WILDCARD],WILDCARD]
>> [node1-128-18:17849] [[65177,0],2] posting recv
>> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 46 for 
>> peer [[WILDCARD],WILDCARD]
>> [node1-128-18:17849] [[65177,0],2] posting recv
>> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 1 for peer 
>> [[WILDCARD],WILDCARD]
>> [node1-128-18:17849] [[65177,0],2] OOB_SEND: rml_oob_send.c:199
>> [node1-128-18:17849] [[65177,0],2]:tcp:processing set_peer cmd
>> [node1-128-18:17849] [[65177,0],2]:tcp:processing set_peer cmd
>> [node1-128-18:17849] [[65177,0],2]:tcp:processing set_peer cmd
>> [node1-128-18:17849] [[65177,0],2]:tcp:processing set_peer cmd
>> [node1-128-18:17849] [[65177,0],2]:tcp:processing set_peer cmd
>> [node1-128-18:17849] [[65177,0],2] oob:base:send to target [[65177,0],0]
>> [node1-128-18:17849] [[65177,0],2] oob:tcp:send_nb to peer [[65177,0],0]:10
>> [node1-128-18:17849] [[65177,0],2] tcp:send_nb to peer [[65177,0],0]
>> [node1-128-18:17849] [[65177,0],2]:[oob_tcp.c:484] post send to [[65177,0],0]
>> [node1-128-18:17849] [[65177,0],2]:[oob_tcp.c:421] processing send to peer 
>> [[65177,0],0]:10
>> [node1-128-18:17849] [[65177,0],2]:[oob_tcp.c:455] queue pending to 
>> [[65177,0],0]
>> [node1-128-18:17849] [[65177,0],2] tcp:send_nb: initiating connection to 
>> [[65177,0],0]
>> [node1-128-18:17849] [[65177,0],2]:[oob_tcp.c:469] connect to [[65177,0],0]
>> [node1-128-18:17849] [[65177,0],2] orte_tcp_peer_try_connect: attempting to 
>> connect to proc [[65177,0],0]
>> [node1-128-18:17849] [[65177,0],2] orte_tcp_peer_try_connect: attempting to 
>> connect to proc [[65177,0],0] on socket 10
>> [node1-128-18:17849] [[65177,0],2] orte_tcp_peer_try_connect: attempting to 
>> connect to proc [[65177,0],0] on 10.0.251.53:49759 - 0 retries
>> [node1-128-18:17849] [[65177,0],2] waiting for connect completion to 
>> [[65177,0],0] - activating send event
>> [node1-128-18:17837] [[61806,0],2] tcp:send_handler called to send to peer 
>> [[61806,0],0]
>> [node1-128-18:17837] [[61806,0],2] tcp:send_handler CONNECTING
>> [node1-128-18:17837] [[61806,0],2]:tcp:complete_connect called for peer 
>> [[61806,0],0] on socket 10
>> [node1-128-18:17837] [[61806,0],2]-[[61806,0],0] tcp_peer_complete_connect: 
>> connection failed: Connection timed out (110)
>> [node1-128-18:17837] [[61806,0],2] tcp_peer_close for [[61806,0],0] sd 10 
>> state CONNECTING
>> [node1-128-18:17837] [[61806,0],2] tcp:lost connection called for peer 
>> [[61806,0],0]
>> [node1-128-18:17837] mca: base: close: component oob closed
>> [node1-128-18:17837] mca: base: close: unloading component oob
>> [node1-128-18:17837] [[61806,0],2] TCP SHUTDOWN
>> [node1-128-18:17837] [[61806,0],2] RELEASING PEER OBJ [[61806,0],0]
>> [node1-128-18:17837] [[61806,0],2] CLOSING SOCKET 10
>> [node1-128-18:17837] mca: base: close: component tcp closed
>> [node1-128-18:17837] mca: base: close: unloading component tcp
>> srun: error: node1-128-18: task 1: Exited with exit code 1
>> srun: Terminating job step 647191.1
>> [node1-128-17:14767] [[61806,0],1] tcp:send_handler called to send to peer 
>> [[61806,0],0]
>> [node1-128-17:14767] [[61806,0],1] tcp:send_handler CONNECTING
>> [node1-128-17:14767] [[61806,0],1]:tcp:complete_connect called for peer 
>> [[61806,0],0] on socket 10
>> [node1-128-17:14767] [[61806,0],1]-[[61806,0],0] tcp_peer_complete_connect: 
>> connection failed: Connection timed out (110)
>> [node1-128-17:14767] [[61806,0],1] tcp_peer_close for [[61806,0],0] sd 10 
>> state CONNECTING
>> [node1-128-17:14767] [[61806,0],1] tcp:lost connection called for peer 
>> [[61806,0],0]
>> [node1-128-17:14767] mca: base: close: component oob closed
>> [node1-128-17:14767] mca: base: close: unloading component oob
>> [node1-128-17:14767] [[61806,0],1] TCP SHUTDOWN
>> [node1-128-17:14767] [[61806,0],1] RELEASING PEER OBJ [[61806,0],0]
>> [node1-128-17:14767] [[61806,0],1] CLOSING SOCKET 10
>> [node1-128-17:14767] mca: base: close: component tcp closed
>> [node1-128-17:14767] mca: base: close: unloading component tcp
>> srun: error: node1-128-17: task 0: Exited with exit code 1
>> [node1-128-17:14779] [[65177,0],1] tcp:send_handler called to send to peer 
>> [[65177,0],0]
>> [node1-128-17:14779] [[65177,0],1] tcp:send_handler CONNECTING
>> [node1-128-17:14779] [[65177,0],1]:tcp:complete_connect called for peer 
>> [[65177,0],0] on socket 10
>> [node1-128-17:14779] [[65177,0],1]-[[65177,0],0] tcp_peer_complete_connect: 
>> connection failed: Connection timed out (110)
>> [node1-128-17:14779] [[65177,0],1] tcp_peer_close for [[65177,0],0] sd 10 
>> state CONNECTING
>> [node1-128-17:14779] [[65177,0],1] tcp:lost connection called for peer 
>> [[65177,0],0]
>> [node1-128-17:14779] mca: base: close: component oob closed
>> [node1-128-17:14779] mca: base: close: unloading component oob
>> [node1-128-17:14779] [[65177,0],1] TCP SHUTDOWN
>> [node1-128-17:14779] [[65177,0],1] RELEASING PEER OBJ [[65177,0],0]
>> [node1-128-17:14779] [[65177,0],1] CLOSING SOCKET 10
>> [node1-128-17:14779] mca: base: close: component tcp closed
>> [node1-128-17:14779] mca: base: close: unloading component tcp
>> [node1-128-18:17849] [[65177,0],2] tcp:send_handler called to send to peer 
>> [[65177,0],0]
>> [node1-128-18:17849] [[65177,0],2] tcp:send_handler CONNECTING
>> [node1-128-18:17849] [[65177,0],2]:tcp:complete_connect called for peer 
>> [[65177,0],0] on socket 10
>> [node1-128-18:17849] [[65177,0],2]-[[65177,0],0] tcp_peer_complete_connect: 
>> connection failed: Connection timed out (110)
>> [node1-128-18:17849] [[65177,0],2] tcp_peer_close for [[65177,0],0] sd 10 
>> state CONNECTING
>> [node1-128-18:17849] [[65177,0],2] tcp:lost connection called for peer 
>> [[65177,0],0]
>> [node1-128-18:17849] mca: base: close: component oob closed
>> [node1-128-18:17849] mca: base: close: unloading component oob
>> [node1-128-18:17849] [[65177,0],2] TCP SHUTDOWN
>> [node1-128-18:17849] [[65177,0],2] RELEASING PEER OBJ [[65177,0],0]
>> [node1-128-18:17849] [[65177,0],2] CLOSING SOCKET 10
>> [node1-128-18:17849] mca: base: close: component tcp closed
>> [node1-128-18:17849] mca: base: close: unloading component tcp
>> srun: error: node1-128-17: task 0: Exited with exit code 1
>> srun: Terminating job step 647191.2
>> srun: error: node1-128-18: task 1: Exited with exit code 1
>> --------------------------------------------------------------------------
>> An ORTE daemon has unexpectedly failed after launch and before
>> communicating back to mpirun. This could be caused by a number
>> of factors, including an inability to create a connection back
>> to mpirun due to a lack of common network interfaces and/or no
>> route found between them. Please check network connectivity
>> (including firewalls and network routing requirements).
>> --------------------------------------------------------------------------
>> [compiler-2:30735] [[65177,0],0] orted_cmd: received halt_vm cmd
>> [compiler-2:30735] mca: base: close: component oob closed
>> [compiler-2:30735] mca: base: close: unloading component oob
>> [compiler-2:30735] [[65177,0],0] TCP SHUTDOWN
>> [compiler-2:30735] mca: base: close: component tcp closed
>> [compiler-2:30735] mca: base: close: unloading component tcp
>> 
>> 
>> 
>> 
>> Sun, 20 Jul 2014 13:11:19 -0700 от Ralph Castain <r...@open-mpi.org>:
>> Yeah, we aren't connecting back - is there a firewall running?  You need to 
>> leave the "--debug-daemons --mca plm_base_verbose 5" on there as well to see 
>> the entire problem.
>> 
>> What you can see here is that mpirun is listening on several interfaces:
>>> [access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.251.51 to our list 
>>> of V4 connections
>>> 
>>> [access1:24264] [[55095,0],0] oob:tcp:init adding 10.2.251.11 to our list 
>>> of V4 connections
>>> 
>>> [access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.0.111 to our list of 
>>> V4 connections
>>> 
>>> [access1:24264] [[55095,0],0] oob:tcp:init adding 10.128.0.1 to our list of 
>>> V4 connections
>>> 
>> 
>>> [access1:24264] [[55095,0],0] oob:tcp:init adding 93.180.7.36 to our list 
>>> of V4 connections
>>> 
>> 
>> It looks like you have multiple interfaces connected to the same subnet - 
>> this is generally a bad idea. I also saw that the last one in the list shows 
>> up twice in the kernel array - not sure why, but is there something special 
>> about that NIC?
>> 
>> What do the NICs look like on the remote hosts?
>> 
>> On Jul 20, 2014, at 10:59 AM, Timur Ismagilov <tismagi...@mail.ru> wrote:
>> 
>>> 
>>> 
>>> 
>>> -------- Пересылаемое сообщение --------
>>> От кого: Timur Ismagilov <tismagi...@mail.ru>
>>> Кому: Ralph Castain <r...@open-mpi.org>
>>> Дата: Sun, 20 Jul 2014 21:58:41 +0400
>>> Тема: Re[2]: [OMPI users] Fwd: Re[4]: Salloc and mpirun problem
>>> 
>>> Here it is:
>>> 
>>> $ salloc -N2 --exclusive -p test -J ompi
>>> salloc: Granted job allocation 647049
>>> 
>>> 
>>> $ mpirun -mca mca_base_env_list 'LD_PRELOAD' -mca oob_base_verbose 10 -mca 
>>> rml_base_verbose 10 -np 2 hello_c
>>> 
>>> [access1:24264] mca: base: components_register: registering oob components
>>> [access1:24264] mca: base: components_register: found loaded component tcp
>>> [access1:24264] mca: base: components_register: component tcp register 
>>> function successful
>>> [access1:24264] mca: base: components_open: opening oob components
>>> [access1:24264] mca: base: components_open: found loaded component tcp
>>> [access1:24264] mca: base: components_open: component tcp open function 
>>> successful
>>> [access1:24264] mca:oob:select: checking available component tcp
>>> [access1:24264] mca:oob:select: Querying component [tcp]
>>> [access1:24264] oob:tcp: component_available called
>>> [access1:24264] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
>>> [access1:24264] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
>>> [access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.251.51 to our list 
>>> of V4 connections
>>> [access1:24264] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
>>> [access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.0.111 to our list of 
>>> V4 connections
>>> [access1:24264] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
>>> [access1:24264] [[55095,0],0] oob:tcp:init adding 10.2.251.11 to our list 
>>> of V4 connections
>>> [access1:24264] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
>>> [access1:24264] [[55095,0],0] oob:tcp:init adding 10.128.0.1 to our list of 
>>> V4 connections
>>> [access1:24264] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4
>>> [access1:24264] [[55095,0],0] oob:tcp:init adding 93.180.7.36 to our list 
>>> of V4 connections
>>> [access1:24264] WORKING INTERFACE 7 KERNEL INDEX 7 FAMILY: V4
>>> [access1:24264] [[55095,0],0] oob:tcp:init adding 93.180.7.36 to our list 
>>> of V4 connections
>>> [access1:24264] [[55095,0],0] TCP STARTUP
>>> [access1:24264] [[55095,0],0] attempting to bind to IPv4 port 0
>>> [access1:24264] [[55095,0],0] assigned IPv4 port 47756
>>> [access1:24264] mca:oob:select: Adding component to end
>>> [access1:24264] mca:oob:select: Found 1 active transports
>>> [access1:24264] mca: base: components_register: registering rml components
>>> [access1:24264] mca: base: components_register: found loaded component oob
>>> [access1:24264] mca: base: components_register: component oob has no 
>>> register or open function
>>> [access1:24264] mca: base: components_open: opening rml components
>>> [access1:24264] mca: base: components_open: found loaded component oob
>>> [access1:24264] mca: base: components_open: component oob open function 
>>> successful
>>> [access1:24264] orte_rml_base_select: initializing rml component oob
>>> [access1:24264] [[55095,0],0] posting recv
>>> [access1:24264] [[55095,0],0] posting persistent recv on tag 30 for peer 
>>> [[WILDCARD],WILDCARD]
>>> [access1:24264] [[55095,0],0] posting recv
>>> [access1:24264] [[55095,0],0] posting persistent recv on tag 15 for peer 
>>> [[WILDCARD],WILDCARD]
>>> [access1:24264] [[55095,0],0] posting recv
>>> [access1:24264] [[55095,0],0] posting persistent recv on tag 32 for peer 
>>> [[WILDCARD],WILDCARD]
>>> [access1:24264] [[55095,0],0] posting recv
>>> [access1:24264] [[55095,0],0] posting persistent recv on tag 33 for peer 
>>> [[WILDCARD],WILDCARD]
>>> [access1:24264] [[55095,0],0] posting recv
>>> [access1:24264] [[55095,0],0] posting persistent recv on tag 5 for peer 
>>> [[WILDCARD],WILDCARD]
>>> [access1:24264] [[55095,0],0] posting recv
>>> [access1:24264] [[55095,0],0] posting persistent recv on tag 10 for peer 
>>> [[WILDCARD],WILDCARD]
>>> [access1:24264] [[55095,0],0] posting recv
>>> [access1:24264] [[55095,0],0] posting persistent recv on tag 12 for peer 
>>> [[WILDCARD],WILDCARD]
>>> [access1:24264] [[55095,0],0] posting recv
>>> [access1:24264] [[55095,0],0] posting persistent recv on tag 9 for peer 
>>> [[WILDCARD],WILDCARD]
>>> [access1:24264] [[55095,0],0] posting recv
>>> [access1:24264] [[55095,0],0] posting persistent recv on tag 34 for peer 
>>> [[WILDCARD],WILDCARD]
>>> [access1:24264] [[55095,0],0] posting recv
>>> [access1:24264] [[55095,0],0] posting persistent recv on tag 2 for peer 
>>> [[WILDCARD],WILDCARD]
>>> [access1:24264] [[55095,0],0] posting recv
>>> [access1:24264] [[55095,0],0] posting persistent recv on tag 21 for peer 
>>> [[WILDCARD],WILDCARD]
>>> [access1:24264] [[55095,0],0] posting recv
>>> [access1:24264] [[55095,0],0] posting persistent recv on tag 22 for peer 
>>> [[WILDCARD],WILDCARD]
>>> [access1:24264] [[55095,0],0] posting recv
>>> [access1:24264] [[55095,0],0] posting persistent recv on tag 45 for peer 
>>> [[WILDCARD],WILDCARD]
>>> [access1:24264] [[55095,0],0] posting recv
>>> [access1:24264] [[55095,0],0] posting persistent recv on tag 46 for peer 
>>> [[WILDCARD],WILDCARD]
>>> [access1:24264] [[55095,0],0] posting recv
>>> [access1:24264] [[55095,0],0] posting persistent recv on tag 1 for peer 
>>> [[WILDCARD],WILDCARD]
>>> [access1:24264] [[55095,0],0] posting recv
>>> [access1:24264] [[55095,0],0] posting persistent recv on tag 27 for peer 
>>> [[WILDCARD],WILDCARD]
>>> --------------------------------------------------------------------------
>>> An ORTE daemon has unexpectedly failed after launch and before
>>> communicating back to mpirun. This could be caused by a number
>>> of factors, including an inability to create a connection back
>>> to mpirun due to a lack of common network interfaces and/or no
>>> route found between them. Please check network connectivity
>>> (including firewalls and network routing requirements).
>>> --------------------------------------------------------------------------
>>> [access1:24264] mca: base: close: component oob closed
>>> [access1:24264] mca: base: close: unloading component oob
>>> [access1:24264] [[55095,0],0] TCP SHUTDOWN
>>> [access1:24264] mca: base: close: component tcp closed
>>> [access1:24264] mca: base: close: unloading component tcp
>>> 
>>> When i use srun i got:
>>> 
>>> $ salloc -N2 --exclusive -p test -J ompi
>>> ....
>>> $srun -N 2 ./hello_c
>>> Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
>>> semenov@compiler-2 Distribution, ident: 1.9a1r32252, repo rev: r32252, Jul 
>>> 16, 2014 (nightly snapshot tarball), 146)
>>> Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
>>> semenov@compiler-2 Distribution, ident: 1.9a1r32252, repo rev: r32252, Jul 
>>> 16, 2014 (nightly snapshot tarball), 146)
>>> 
>>> 
>>> Sun, 20 Jul 2014 09:28:13 -0700 от Ralph Castain <r...@open-mpi.org>:
>>> 
>>> Try adding -mca oob_base_verbose 10 -mca rml_base_verbose 10 to your cmd 
>>> line. It looks to me like we are unable to connect back to the node where 
>>> you are running mpirun for some reason.
>>> 
>>> 
>>> On Jul 20, 2014, at 9:16 AM, Timur Ismagilov <tismagi...@mail.ru> wrote:
>>> 
>>>> I have the same problem in openmpi 1.8.1(Apr 23, 2014).
>>>> Does the srun command have  a --map-by<foo> mpirun parameter, or can i 
>>>> chage it from bash enviroment?
>>>> 
>>>> 
>>>> 
>>>> -------- Пересылаемое сообщение --------
>>>> От кого: Timur Ismagilov <tismagi...@mail.ru>
>>>> Кому: Mike Dubman <mi...@dev.mellanox.co.il>
>>>> Копия: Open MPI Users <us...@open-mpi.org>
>>>> Дата: Thu, 17 Jul 2014 16:42:24 +0400
>>>> Тема: Re[4]: [OMPI users] Salloc and mpirun problem
>>>> 
>>>> 
>>>> With Open MPI 1.9a1r32252 (Jul 16, 2014 (nightly snapshot tarball)) i got 
>>>> this output (same?):
>>>> 
>>>> $ salloc -N2 --exclusive -p test -J ompi
>>>> salloc: Granted job allocation 645686
>>>> 
>>>> $LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
>>>>   mpirun  -mca mca_base_env_list 'LD_PRELOAD'  --mca plm_base_verbose 10 
>>>> --debug-daemons -np 1 hello_c
>>>> 
>>>> [access1:04312] mca: base: components_register: registering plm components
>>>> [access1:04312] mca: base: components_register: found loaded component 
>>>> isolated
>>>> [access1:04312] mca: base: components_register: component isolated has no 
>>>> register or open function
>>>> [access1:04312] mca: base: components_register: found loaded component rsh
>>>> [access1:04312] mca: base: components_register: component rsh register 
>>>> function successful
>>>> [access1:04312] mca: base: components_register: found loaded component 
>>>> slurm
>>>> [access1:04312] mca: base: components_register: component slurm register 
>>>> function successful
>>>> [access1:04312] mca: base: components_open: opening plm components
>>>> [access1:04312] mca: base: components_open: found loaded component isolated
>>>> [access1:04312] mca: base: components_open: component isolated open 
>>>> function successful
>>>> [access1:04312] mca: base: components_open: found loaded component rsh
>>>> [access1:04312] mca: base: components_open: component rsh open function 
>>>> successful
>>>> [access1:04312] mca: base: components_open: found loaded component slurm
>>>> [access1:04312] mca: base: components_open: component slurm open function 
>>>> successful
>>>> [access1:04312] mca:base:select: Auto-selecting plm components
>>>> [access1:04312] mca:base:select:( plm) Querying component [isolated]
>>>> [access1:04312] mca:base:select:( plm) Query of component [isolated] set 
>>>> priority to 0
>>>> [access1:04312] mca:base:select:( plm) Querying component [rsh]
>>>> [access1:04312] mca:base:select:( plm) Query of component [rsh] set 
>>>> priority to 10
>>>> [access1:04312] mca:base:select:( plm) Querying component [slurm]
>>>> [access1:04312] mca:base:select:( plm) Query of component [slurm] set 
>>>> priority to 75
>>>> [access1:04312] mca:base:select:( plm) Selected component [slurm]
>>>> [access1:04312] mca: base: close: component isolated closed
>>>> [access1:04312] mca: base: close: unloading component isolated
>>>> [access1:04312] mca: base: close: component rsh closed
>>>> [access1:04312] mca: base: close: unloading component rsh
>>>> Daemon was launched on node1-128-09 - beginning to initialize
>>>> Daemon was launched on node1-128-15 - beginning to initialize
>>>> Daemon [[39207,0],1] checking in as pid 26240 on host node1-128-09
>>>> [node1-128-09:26240] [[39207,0],1] orted: up and running - waiting for 
>>>> commands!
>>>> Daemon [[39207,0],2] checking in as pid 30129 on host node1-128-15
>>>> [node1-128-15:30129] [[39207,0],2] orted: up and running - waiting for 
>>>> commands!
>>>> srun: error: node1-128-09: task 0: Exited with exit code 1
>>>> srun: Terminating job step 645686.3
>>>> srun: error: node1-128-15: task 1: Exited with exit code 1
>>>> --------------------------------------------------------------------------
>>>> An ORTE daemon has unexpectedly failed after launch and before
>>>> communicating back to mpirun. This could be caused by a number
>>>> of factors, including an inability to create a connection back
>>>> to mpirun due to a lack of common network interfaces and/or no
>>>> route found between them. Please check network connectivity
>>>> (including firewalls and network routing requirements).
>>>> --------------------------------------------------------------------------
>>>> [access1:04312] [[39207,0],0] orted_cmd: received halt_vm cmd
>>>> [access1:04312] mca: base: close: component slurm closed
>>>> [access1:04312] mca: base: close: unloading component slurm
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/07/24828.php
>> 
>> 
>> 
>> 
> 
> 
> 
> 

Reply via email to