It's supposed to, so it sounds like we have a bug in the connection failover mechanism. I'll address it
On Jul 23, 2014, at 1:21 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: > Thanks, Ralph! > When I add --mca oob_tcp_if_include ib0 (where ib0 is infiniband interface > from ifconfig) to mpirun it starts working correct! > Why OpenMPI doesn't do it itself? > > Tue, 22 Jul 2014 11:26:16 -0700 от Ralph Castain <r...@open-mpi.org>: > Okay, the problem is that the connection back to mpirun isn't getting thru. > We are trying on the 10.0.251.53 address - is that blocked, or should we be > using something else? If so, you might want to direct us by adding "-mca > oob_tcp_if_include foo", where foo is the interface you want us to use > > > On Jul 20, 2014, at 10:24 PM, Timur Ismagilov <tismagi...@mail.ru> wrote: > >> NIC = network interface controller? >> >> There is QDR Infiniband 4x/10G Ethernet/Gigabit Ethernet. >> I want to use QDR Infiniband. >> >> Here is a new output: >> >> $ mpirun -mca mca_base_env_list 'LD_PRELOAD' --debug-daemons --mca >> plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 2 >> hello_c |tee hello.out >> Warning: Conflicting CPU frequencies detected, using: 2927.000000. >> [compiler-2:30735] mca:base:select:( plm) Querying component [isolated] >> [compiler-2:30735] mca:base:select:( plm) Query of component [isolated] set >> priority to 0 >> [compiler-2:30735] mca:base:select:( plm) Querying component [rsh] >> [compiler-2:30735] mca:base:select:( plm) Query of component [rsh] set >> priority to 10 >> [compiler-2:30735] mca:base:select:( plm) Querying component [slurm] >> [compiler-2:30735] mca:base:select:( plm) Query of component [slurm] set >> priority to 75 >> [compiler-2:30735] mca:base:select:( plm) Selected component [slurm] >> [compiler-2:30735] mca: base: components_register: registering oob components >> [compiler-2:30735] mca: base: components_register: found loaded component tcp >> [compiler-2:30735] mca: base: components_register: component tcp register >> function successful >> [compiler-2:30735] mca: base: components_open: opening oob components >> [compiler-2:30735] mca: base: components_open: found loaded component tcp >> [compiler-2:30735] mca: base: components_open: component tcp open function >> successful >> [compiler-2:30735] mca:oob:select: checking available component tcp >> [compiler-2:30735] mca:oob:select: Querying component [tcp] >> [compiler-2:30735] oob:tcp: component_available called >> [compiler-2:30735] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >> [compiler-2:30735] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >> [compiler-2:30735] [[65177,0],0] oob:tcp:init adding 10.0.251.53 to our list >> of V4 connections >> [compiler-2:30735] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >> [compiler-2:30735] [[65177,0],0] oob:tcp:init adding 10.0.0.4 to our list of >> V4 connections >> [compiler-2:30735] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >> [compiler-2:30735] [[65177,0],0] oob:tcp:init adding 10.2.251.14 to our list >> of V4 connections >> [compiler-2:30735] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >> [compiler-2:30735] [[65177,0],0] oob:tcp:init adding 10.128.0.4 to our list >> of V4 connections >> [compiler-2:30735] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >> [compiler-2:30735] [[65177,0],0] oob:tcp:init adding 93.180.7.38 to our list >> of V4 connections >> [compiler-2:30735] [[65177,0],0] TCP STARTUP >> [compiler-2:30735] [[65177,0],0] attempting to bind to IPv4 port 0 >> [compiler-2:30735] [[65177,0],0] assigned IPv4 port 49759 >> [compiler-2:30735] mca:oob:select: Adding component to end >> [compiler-2:30735] mca:oob:select: Found 1 active transports >> [compiler-2:30735] mca: base: components_register: registering rml components >> [compiler-2:30735] mca: base: components_register: found loaded component oob >> [compiler-2:30735] mca: base: components_register: component oob has no >> register or open function >> [compiler-2:30735] mca: base: components_open: opening rml components >> [compiler-2:30735] mca: base: components_open: found loaded component oob >> [compiler-2:30735] mca: base: components_open: component oob open function >> successful >> [compiler-2:30735] orte_rml_base_select: initializing rml component oob >> [compiler-2:30735] [[65177,0],0] posting recv >> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 30 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:30735] [[65177,0],0] posting recv >> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 15 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:30735] [[65177,0],0] posting recv >> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 32 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:30735] [[65177,0],0] posting recv >> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 33 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:30735] [[65177,0],0] posting recv >> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 5 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:30735] [[65177,0],0] posting recv >> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 10 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:30735] [[65177,0],0] posting recv >> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 12 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:30735] [[65177,0],0] posting recv >> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 9 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:30735] [[65177,0],0] posting recv >> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 34 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:30735] [[65177,0],0] posting recv >> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 2 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:30735] [[65177,0],0] posting recv >> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 21 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:30735] [[65177,0],0] posting recv >> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 22 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:30735] [[65177,0],0] posting recv >> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 45 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:30735] [[65177,0],0] posting recv >> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 46 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:30735] [[65177,0],0] posting recv >> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 1 for peer >> [[WILDCARD],WILDCARD] >> [compiler-2:30735] [[65177,0],0] posting recv >> [compiler-2:30735] [[65177,0],0] posting persistent recv on tag 27 for peer >> [[WILDCARD],WILDCARD] >> Daemon was launched on node1-128-17 - beginning to initialize >> Daemon was launched on node1-128-18 - beginning to initialize >> [node1-128-17:14779] mca: base: components_register: registering oob >> components >> [node1-128-17:14779] mca: base: components_register: found loaded component >> tcp >> [node1-128-17:14779] mca: base: components_register: component tcp register >> function successful >> [node1-128-17:14779] mca: base: components_open: opening oob components >> [node1-128-17:14779] mca: base: components_open: found loaded component tcp >> [node1-128-17:14779] mca: base: components_open: component tcp open function >> successful >> [node1-128-17:14779] mca:oob:select: checking available component tcp >> [node1-128-17:14779] mca:oob:select: Querying component [tcp] >> [node1-128-17:14779] oob:tcp: component_available called >> [node1-128-17:14779] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >> [node1-128-17:14779] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >> [node1-128-17:14779] [[65177,0],1] oob:tcp:init adding 10.0.128.17 to our >> list of V4 connections >> [node1-128-17:14779] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >> [node1-128-17:14779] [[65177,0],1] oob:tcp:init adding 10.128.128.17 to our >> list of V4 connections >> [node1-128-17:14779] [[65177,0],1] TCP STARTUP >> [node1-128-17:14779] [[65177,0],1] attempting to bind to IPv4 port 0 >> [node1-128-17:14779] [[65177,0],1] assigned IPv4 port 46441 >> [node1-128-17:14779] mca:oob:select: Adding component to end >> [node1-128-17:14779] mca:oob:select: Found 1 active transports >> [node1-128-17:14779] mca: base: components_register: registering rml >> components >> [node1-128-17:14779] mca: base: components_register: found loaded component >> oob >> [node1-128-17:14779] mca: base: components_register: component oob has no >> register or open function >> [node1-128-17:14779] mca: base: components_open: opening rml components >> [node1-128-17:14779] mca: base: components_open: found loaded component oob >> [node1-128-17:14779] mca: base: components_open: component oob open function >> successful >> [node1-128-17:14779] orte_rml_base_select: initializing rml component oob >> [node1-128-18:17849] mca: base: components_register: registering oob >> components >> [node1-128-18:17849] mca: base: components_register: found loaded component >> tcp >> [node1-128-18:17849] mca: base: components_register: component tcp register >> function successful >> [node1-128-18:17849] mca: base: components_open: opening oob components >> [node1-128-18:17849] mca: base: components_open: found loaded component tcp >> [node1-128-18:17849] mca: base: components_open: component tcp open function >> successful >> [node1-128-18:17849] mca:oob:select: checking available component tcp >> [node1-128-18:17849] mca:oob:select: Querying component [tcp] >> [node1-128-18:17849] oob:tcp: component_available called >> [node1-128-18:17849] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >> [node1-128-18:17849] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >> [node1-128-18:17849] [[65177,0],2] oob:tcp:init adding 10.0.128.18 to our >> list of V4 connections >> [node1-128-18:17849] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >> [node1-128-18:17849] [[65177,0],2] oob:tcp:init adding 10.128.128.18 to our >> list of V4 connections >> [node1-128-18:17849] [[65177,0],2] TCP STARTUP >> [node1-128-18:17849] [[65177,0],2] attempting to bind to IPv4 port 0 >> [node1-128-18:17849] [[65177,0],2] assigned IPv4 port 60695 >> [node1-128-18:17849] mca:oob:select: Adding component to end >> [node1-128-18:17849] mca:oob:select: Found 1 active transports >> [node1-128-18:17849] mca: base: components_register: registering rml >> components >> [node1-128-18:17849] mca: base: components_register: found loaded component >> oob >> [node1-128-18:17849] mca: base: components_register: component oob has no >> register or open function >> [node1-128-18:17849] mca: base: components_open: opening rml components >> [node1-128-18:17849] mca: base: components_open: found loaded component oob >> [node1-128-18:17849] mca: base: components_open: component oob open function >> successful >> [node1-128-18:17849] orte_rml_base_select: initializing rml component oob >> Daemon [[65177,0],1] checking in as pid 14779 on host node1-128-17 >> [node1-128-17:14779] [[65177,0],1] orted: up and running - waiting for >> commands! >> [node1-128-17:14779] [[65177,0],1] posting recv >> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 30 for >> peer [[WILDCARD],WILDCARD] >> [node1-128-17:14779] [[65177,0],1] posting recv >> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 15 for >> peer [[WILDCARD],WILDCARD] >> [node1-128-17:14779] [[65177,0],1] posting recv >> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 32 for >> peer [[WILDCARD],WILDCARD] >> [node1-128-17:14779] [[65177,0],1] posting recv >> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 11 for >> peer [[WILDCARD],WILDCARD] >> [node1-128-17:14779] [[65177,0],1] posting recv >> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 9 for peer >> [[WILDCARD],WILDCARD] >> [node1-128-17:14779] [[65177,0],1]: set_addr to uri >> 4271439872.0;tcp://10.0.251.53,10.0.0.4,10.2.251.14,10.128.0.4,93.180.7.38:49759 >> [node1-128-17:14779] [[65177,0],1]:set_addr checking if peer [[65177,0],0] >> is reachable via component tcp >> [node1-128-17:14779] [[65177,0],1] oob:tcp: working peer [[65177,0],0] >> address tcp://10.0.251.53,10.0.0.4,10.2.251.14,10.128.0.4,93.180.7.38:49759 >> [node1-128-17:14779] [[65177,0],1] PASSING ADDR 10.0.251.53 TO MODULE >> [node1-128-17:14779] [[65177,0],1]:tcp set addr for peer [[65177,0],0] >> [node1-128-17:14779] [[65177,0],1] PASSING ADDR 10.0.0.4 TO MODULE >> [node1-128-17:14779] [[65177,0],1]:tcp set addr for peer [[65177,0],0] >> [node1-128-17:14779] [[65177,0],1] PASSING ADDR 10.2.251.14 TO MODULE >> [node1-128-17:14779] [[65177,0],1]:tcp set addr for peer [[65177,0],0] >> [node1-128-17:14779] [[65177,0],1] PASSING ADDR 10.128.0.4 TO MODULE >> [node1-128-17:14779] [[65177,0],1]:tcp set addr for peer [[65177,0],0] >> [node1-128-17:14779] [[65177,0],1] PASSING ADDR 93.180.7.38 TO MODULE >> [node1-128-17:14779] [[65177,0],1]:tcp set addr for peer [[65177,0],0] >> [node1-128-17:14779] [[65177,0],1]: peer [[65177,0],0] is reachable via >> component tcp >> [node1-128-17:14779] [[65177,0],1] posting recv >> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 3 for peer >> [[WILDCARD],WILDCARD] >> [node1-128-17:14779] [[65177,0],1] posting recv >> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 21 for >> peer [[WILDCARD],WILDCARD] >> [node1-128-17:14779] [[65177,0],1] posting recv >> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 45 for >> peer [[WILDCARD],WILDCARD] >> [node1-128-17:14779] [[65177,0],1] posting recv >> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 46 for >> peer [[WILDCARD],WILDCARD] >> [node1-128-17:14779] [[65177,0],1] posting recv >> [node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 1 for peer >> [[WILDCARD],WILDCARD] >> [node1-128-17:14779] [[65177,0],1] OOB_SEND: rml_oob_send.c:199 >> [node1-128-17:14779] [[65177,0],1]:tcp:processing set_peer cmd >> [node1-128-17:14779] [[65177,0],1]:tcp:processing set_peer cmd >> [node1-128-17:14779] [[65177,0],1]:tcp:processing set_peer cmd >> [node1-128-17:14779] [[65177,0],1]:tcp:processing set_peer cmd >> [node1-128-17:14779] [[65177,0],1]:tcp:processing set_peer cmd >> [node1-128-17:14779] [[65177,0],1] oob:base:send to target [[65177,0],0] >> [node1-128-17:14779] [[65177,0],1] oob:tcp:send_nb to peer [[65177,0],0]:10 >> [node1-128-17:14779] [[65177,0],1] tcp:send_nb to peer [[65177,0],0] >> [node1-128-17:14779] [[65177,0],1]:[oob_tcp.c:484] post send to [[65177,0],0] >> [node1-128-17:14779] [[65177,0],1]:[oob_tcp.c:421] processing send to peer >> [[65177,0],0]:10 >> [node1-128-17:14779] [[65177,0],1]:[oob_tcp.c:455] queue pending to >> [[65177,0],0] >> [node1-128-17:14779] [[65177,0],1] tcp:send_nb: initiating connection to >> [[65177,0],0] >> [node1-128-17:14779] [[65177,0],1]:[oob_tcp.c:469] connect to [[65177,0],0] >> [node1-128-17:14779] [[65177,0],1] orte_tcp_peer_try_connect: attempting to >> connect to proc [[65177,0],0] >> [node1-128-17:14779] [[65177,0],1] orte_tcp_peer_try_connect: attempting to >> connect to proc [[65177,0],0] on socket 10 >> [node1-128-17:14779] [[65177,0],1] orte_tcp_peer_try_connect: attempting to >> connect to proc [[65177,0],0] on 10.0.251.53:49759 - 0 retries >> [node1-128-17:14779] [[65177,0],1] waiting for connect completion to >> [[65177,0],0] - activating send event >> Daemon [[65177,0],2] checking in as pid 17849 on host node1-128-18 >> [node1-128-18:17849] [[65177,0],2] orted: up and running - waiting for >> commands! >> [node1-128-18:17849] [[65177,0],2] posting recv >> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 30 for >> peer [[WILDCARD],WILDCARD] >> [node1-128-18:17849] [[65177,0],2] posting recv >> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 15 for >> peer [[WILDCARD],WILDCARD] >> [node1-128-18:17849] [[65177,0],2] posting recv >> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 32 for >> peer [[WILDCARD],WILDCARD] >> [node1-128-18:17849] [[65177,0],2] posting recv >> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 11 for >> peer [[WILDCARD],WILDCARD] >> [node1-128-18:17849] [[65177,0],2] posting recv >> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 9 for peer >> [[WILDCARD],WILDCARD] >> [node1-128-18:17849] [[65177,0],2]: set_addr to uri >> 4271439872.0;tcp://10.0.251.53,10.0.0.4,10.2.251.14,10.128.0.4,93.180.7.38:49759 >> [node1-128-18:17849] [[65177,0],2]:set_addr checking if peer [[65177,0],0] >> is reachable via component tcp >> [node1-128-18:17849] [[65177,0],2] oob:tcp: working peer [[65177,0],0] >> address tcp://10.0.251.53,10.0.0.4,10.2.251.14,10.128.0.4,93.180.7.38:49759 >> [node1-128-18:17849] [[65177,0],2] PASSING ADDR 10.0.251.53 TO MODULE >> [node1-128-18:17849] [[65177,0],2]:tcp set addr for peer [[65177,0],0] >> [node1-128-18:17849] [[65177,0],2] PASSING ADDR 10.0.0.4 TO MODULE >> [node1-128-18:17849] [[65177,0],2]:tcp set addr for peer [[65177,0],0] >> [node1-128-18:17849] [[65177,0],2] PASSING ADDR 10.2.251.14 TO MODULE >> [node1-128-18:17849] [[65177,0],2]:tcp set addr for peer [[65177,0],0] >> [node1-128-18:17849] [[65177,0],2] PASSING ADDR 10.128.0.4 TO MODULE >> [node1-128-18:17849] [[65177,0],2]:tcp set addr for peer [[65177,0],0] >> [node1-128-18:17849] [[65177,0],2] PASSING ADDR 93.180.7.38 TO MODULE >> [node1-128-18:17849] [[65177,0],2]:tcp set addr for peer [[65177,0],0] >> [node1-128-18:17849] [[65177,0],2]: peer [[65177,0],0] is reachable via >> component tcp >> [node1-128-18:17849] [[65177,0],2] posting recv >> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 3 for peer >> [[WILDCARD],WILDCARD] >> [node1-128-18:17849] [[65177,0],2] posting recv >> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 21 for >> peer [[WILDCARD],WILDCARD] >> [node1-128-18:17849] [[65177,0],2] posting recv >> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 45 for >> peer [[WILDCARD],WILDCARD] >> [node1-128-18:17849] [[65177,0],2] posting recv >> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 46 for >> peer [[WILDCARD],WILDCARD] >> [node1-128-18:17849] [[65177,0],2] posting recv >> [node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 1 for peer >> [[WILDCARD],WILDCARD] >> [node1-128-18:17849] [[65177,0],2] OOB_SEND: rml_oob_send.c:199 >> [node1-128-18:17849] [[65177,0],2]:tcp:processing set_peer cmd >> [node1-128-18:17849] [[65177,0],2]:tcp:processing set_peer cmd >> [node1-128-18:17849] [[65177,0],2]:tcp:processing set_peer cmd >> [node1-128-18:17849] [[65177,0],2]:tcp:processing set_peer cmd >> [node1-128-18:17849] [[65177,0],2]:tcp:processing set_peer cmd >> [node1-128-18:17849] [[65177,0],2] oob:base:send to target [[65177,0],0] >> [node1-128-18:17849] [[65177,0],2] oob:tcp:send_nb to peer [[65177,0],0]:10 >> [node1-128-18:17849] [[65177,0],2] tcp:send_nb to peer [[65177,0],0] >> [node1-128-18:17849] [[65177,0],2]:[oob_tcp.c:484] post send to [[65177,0],0] >> [node1-128-18:17849] [[65177,0],2]:[oob_tcp.c:421] processing send to peer >> [[65177,0],0]:10 >> [node1-128-18:17849] [[65177,0],2]:[oob_tcp.c:455] queue pending to >> [[65177,0],0] >> [node1-128-18:17849] [[65177,0],2] tcp:send_nb: initiating connection to >> [[65177,0],0] >> [node1-128-18:17849] [[65177,0],2]:[oob_tcp.c:469] connect to [[65177,0],0] >> [node1-128-18:17849] [[65177,0],2] orte_tcp_peer_try_connect: attempting to >> connect to proc [[65177,0],0] >> [node1-128-18:17849] [[65177,0],2] orte_tcp_peer_try_connect: attempting to >> connect to proc [[65177,0],0] on socket 10 >> [node1-128-18:17849] [[65177,0],2] orte_tcp_peer_try_connect: attempting to >> connect to proc [[65177,0],0] on 10.0.251.53:49759 - 0 retries >> [node1-128-18:17849] [[65177,0],2] waiting for connect completion to >> [[65177,0],0] - activating send event >> [node1-128-18:17837] [[61806,0],2] tcp:send_handler called to send to peer >> [[61806,0],0] >> [node1-128-18:17837] [[61806,0],2] tcp:send_handler CONNECTING >> [node1-128-18:17837] [[61806,0],2]:tcp:complete_connect called for peer >> [[61806,0],0] on socket 10 >> [node1-128-18:17837] [[61806,0],2]-[[61806,0],0] tcp_peer_complete_connect: >> connection failed: Connection timed out (110) >> [node1-128-18:17837] [[61806,0],2] tcp_peer_close for [[61806,0],0] sd 10 >> state CONNECTING >> [node1-128-18:17837] [[61806,0],2] tcp:lost connection called for peer >> [[61806,0],0] >> [node1-128-18:17837] mca: base: close: component oob closed >> [node1-128-18:17837] mca: base: close: unloading component oob >> [node1-128-18:17837] [[61806,0],2] TCP SHUTDOWN >> [node1-128-18:17837] [[61806,0],2] RELEASING PEER OBJ [[61806,0],0] >> [node1-128-18:17837] [[61806,0],2] CLOSING SOCKET 10 >> [node1-128-18:17837] mca: base: close: component tcp closed >> [node1-128-18:17837] mca: base: close: unloading component tcp >> srun: error: node1-128-18: task 1: Exited with exit code 1 >> srun: Terminating job step 647191.1 >> [node1-128-17:14767] [[61806,0],1] tcp:send_handler called to send to peer >> [[61806,0],0] >> [node1-128-17:14767] [[61806,0],1] tcp:send_handler CONNECTING >> [node1-128-17:14767] [[61806,0],1]:tcp:complete_connect called for peer >> [[61806,0],0] on socket 10 >> [node1-128-17:14767] [[61806,0],1]-[[61806,0],0] tcp_peer_complete_connect: >> connection failed: Connection timed out (110) >> [node1-128-17:14767] [[61806,0],1] tcp_peer_close for [[61806,0],0] sd 10 >> state CONNECTING >> [node1-128-17:14767] [[61806,0],1] tcp:lost connection called for peer >> [[61806,0],0] >> [node1-128-17:14767] mca: base: close: component oob closed >> [node1-128-17:14767] mca: base: close: unloading component oob >> [node1-128-17:14767] [[61806,0],1] TCP SHUTDOWN >> [node1-128-17:14767] [[61806,0],1] RELEASING PEER OBJ [[61806,0],0] >> [node1-128-17:14767] [[61806,0],1] CLOSING SOCKET 10 >> [node1-128-17:14767] mca: base: close: component tcp closed >> [node1-128-17:14767] mca: base: close: unloading component tcp >> srun: error: node1-128-17: task 0: Exited with exit code 1 >> [node1-128-17:14779] [[65177,0],1] tcp:send_handler called to send to peer >> [[65177,0],0] >> [node1-128-17:14779] [[65177,0],1] tcp:send_handler CONNECTING >> [node1-128-17:14779] [[65177,0],1]:tcp:complete_connect called for peer >> [[65177,0],0] on socket 10 >> [node1-128-17:14779] [[65177,0],1]-[[65177,0],0] tcp_peer_complete_connect: >> connection failed: Connection timed out (110) >> [node1-128-17:14779] [[65177,0],1] tcp_peer_close for [[65177,0],0] sd 10 >> state CONNECTING >> [node1-128-17:14779] [[65177,0],1] tcp:lost connection called for peer >> [[65177,0],0] >> [node1-128-17:14779] mca: base: close: component oob closed >> [node1-128-17:14779] mca: base: close: unloading component oob >> [node1-128-17:14779] [[65177,0],1] TCP SHUTDOWN >> [node1-128-17:14779] [[65177,0],1] RELEASING PEER OBJ [[65177,0],0] >> [node1-128-17:14779] [[65177,0],1] CLOSING SOCKET 10 >> [node1-128-17:14779] mca: base: close: component tcp closed >> [node1-128-17:14779] mca: base: close: unloading component tcp >> [node1-128-18:17849] [[65177,0],2] tcp:send_handler called to send to peer >> [[65177,0],0] >> [node1-128-18:17849] [[65177,0],2] tcp:send_handler CONNECTING >> [node1-128-18:17849] [[65177,0],2]:tcp:complete_connect called for peer >> [[65177,0],0] on socket 10 >> [node1-128-18:17849] [[65177,0],2]-[[65177,0],0] tcp_peer_complete_connect: >> connection failed: Connection timed out (110) >> [node1-128-18:17849] [[65177,0],2] tcp_peer_close for [[65177,0],0] sd 10 >> state CONNECTING >> [node1-128-18:17849] [[65177,0],2] tcp:lost connection called for peer >> [[65177,0],0] >> [node1-128-18:17849] mca: base: close: component oob closed >> [node1-128-18:17849] mca: base: close: unloading component oob >> [node1-128-18:17849] [[65177,0],2] TCP SHUTDOWN >> [node1-128-18:17849] [[65177,0],2] RELEASING PEER OBJ [[65177,0],0] >> [node1-128-18:17849] [[65177,0],2] CLOSING SOCKET 10 >> [node1-128-18:17849] mca: base: close: component tcp closed >> [node1-128-18:17849] mca: base: close: unloading component tcp >> srun: error: node1-128-17: task 0: Exited with exit code 1 >> srun: Terminating job step 647191.2 >> srun: error: node1-128-18: task 1: Exited with exit code 1 >> -------------------------------------------------------------------------- >> An ORTE daemon has unexpectedly failed after launch and before >> communicating back to mpirun. This could be caused by a number >> of factors, including an inability to create a connection back >> to mpirun due to a lack of common network interfaces and/or no >> route found between them. Please check network connectivity >> (including firewalls and network routing requirements). >> -------------------------------------------------------------------------- >> [compiler-2:30735] [[65177,0],0] orted_cmd: received halt_vm cmd >> [compiler-2:30735] mca: base: close: component oob closed >> [compiler-2:30735] mca: base: close: unloading component oob >> [compiler-2:30735] [[65177,0],0] TCP SHUTDOWN >> [compiler-2:30735] mca: base: close: component tcp closed >> [compiler-2:30735] mca: base: close: unloading component tcp >> >> >> >> >> Sun, 20 Jul 2014 13:11:19 -0700 от Ralph Castain <r...@open-mpi.org>: >> Yeah, we aren't connecting back - is there a firewall running? You need to >> leave the "--debug-daemons --mca plm_base_verbose 5" on there as well to see >> the entire problem. >> >> What you can see here is that mpirun is listening on several interfaces: >>> [access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.251.51 to our list >>> of V4 connections >>> >>> [access1:24264] [[55095,0],0] oob:tcp:init adding 10.2.251.11 to our list >>> of V4 connections >>> >>> [access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.0.111 to our list of >>> V4 connections >>> >>> [access1:24264] [[55095,0],0] oob:tcp:init adding 10.128.0.1 to our list of >>> V4 connections >>> >> >>> [access1:24264] [[55095,0],0] oob:tcp:init adding 93.180.7.36 to our list >>> of V4 connections >>> >> >> It looks like you have multiple interfaces connected to the same subnet - >> this is generally a bad idea. I also saw that the last one in the list shows >> up twice in the kernel array - not sure why, but is there something special >> about that NIC? >> >> What do the NICs look like on the remote hosts? >> >> On Jul 20, 2014, at 10:59 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: >> >>> >>> >>> >>> -------- Пересылаемое сообщение -------- >>> От кого: Timur Ismagilov <tismagi...@mail.ru> >>> Кому: Ralph Castain <r...@open-mpi.org> >>> Дата: Sun, 20 Jul 2014 21:58:41 +0400 >>> Тема: Re[2]: [OMPI users] Fwd: Re[4]: Salloc and mpirun problem >>> >>> Here it is: >>> >>> $ salloc -N2 --exclusive -p test -J ompi >>> salloc: Granted job allocation 647049 >>> >>> >>> $ mpirun -mca mca_base_env_list 'LD_PRELOAD' -mca oob_base_verbose 10 -mca >>> rml_base_verbose 10 -np 2 hello_c >>> >>> [access1:24264] mca: base: components_register: registering oob components >>> [access1:24264] mca: base: components_register: found loaded component tcp >>> [access1:24264] mca: base: components_register: component tcp register >>> function successful >>> [access1:24264] mca: base: components_open: opening oob components >>> [access1:24264] mca: base: components_open: found loaded component tcp >>> [access1:24264] mca: base: components_open: component tcp open function >>> successful >>> [access1:24264] mca:oob:select: checking available component tcp >>> [access1:24264] mca:oob:select: Querying component [tcp] >>> [access1:24264] oob:tcp: component_available called >>> [access1:24264] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>> [access1:24264] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >>> [access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.251.51 to our list >>> of V4 connections >>> [access1:24264] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >>> [access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.0.111 to our list of >>> V4 connections >>> [access1:24264] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >>> [access1:24264] [[55095,0],0] oob:tcp:init adding 10.2.251.11 to our list >>> of V4 connections >>> [access1:24264] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >>> [access1:24264] [[55095,0],0] oob:tcp:init adding 10.128.0.1 to our list of >>> V4 connections >>> [access1:24264] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >>> [access1:24264] [[55095,0],0] oob:tcp:init adding 93.180.7.36 to our list >>> of V4 connections >>> [access1:24264] WORKING INTERFACE 7 KERNEL INDEX 7 FAMILY: V4 >>> [access1:24264] [[55095,0],0] oob:tcp:init adding 93.180.7.36 to our list >>> of V4 connections >>> [access1:24264] [[55095,0],0] TCP STARTUP >>> [access1:24264] [[55095,0],0] attempting to bind to IPv4 port 0 >>> [access1:24264] [[55095,0],0] assigned IPv4 port 47756 >>> [access1:24264] mca:oob:select: Adding component to end >>> [access1:24264] mca:oob:select: Found 1 active transports >>> [access1:24264] mca: base: components_register: registering rml components >>> [access1:24264] mca: base: components_register: found loaded component oob >>> [access1:24264] mca: base: components_register: component oob has no >>> register or open function >>> [access1:24264] mca: base: components_open: opening rml components >>> [access1:24264] mca: base: components_open: found loaded component oob >>> [access1:24264] mca: base: components_open: component oob open function >>> successful >>> [access1:24264] orte_rml_base_select: initializing rml component oob >>> [access1:24264] [[55095,0],0] posting recv >>> [access1:24264] [[55095,0],0] posting persistent recv on tag 30 for peer >>> [[WILDCARD],WILDCARD] >>> [access1:24264] [[55095,0],0] posting recv >>> [access1:24264] [[55095,0],0] posting persistent recv on tag 15 for peer >>> [[WILDCARD],WILDCARD] >>> [access1:24264] [[55095,0],0] posting recv >>> [access1:24264] [[55095,0],0] posting persistent recv on tag 32 for peer >>> [[WILDCARD],WILDCARD] >>> [access1:24264] [[55095,0],0] posting recv >>> [access1:24264] [[55095,0],0] posting persistent recv on tag 33 for peer >>> [[WILDCARD],WILDCARD] >>> [access1:24264] [[55095,0],0] posting recv >>> [access1:24264] [[55095,0],0] posting persistent recv on tag 5 for peer >>> [[WILDCARD],WILDCARD] >>> [access1:24264] [[55095,0],0] posting recv >>> [access1:24264] [[55095,0],0] posting persistent recv on tag 10 for peer >>> [[WILDCARD],WILDCARD] >>> [access1:24264] [[55095,0],0] posting recv >>> [access1:24264] [[55095,0],0] posting persistent recv on tag 12 for peer >>> [[WILDCARD],WILDCARD] >>> [access1:24264] [[55095,0],0] posting recv >>> [access1:24264] [[55095,0],0] posting persistent recv on tag 9 for peer >>> [[WILDCARD],WILDCARD] >>> [access1:24264] [[55095,0],0] posting recv >>> [access1:24264] [[55095,0],0] posting persistent recv on tag 34 for peer >>> [[WILDCARD],WILDCARD] >>> [access1:24264] [[55095,0],0] posting recv >>> [access1:24264] [[55095,0],0] posting persistent recv on tag 2 for peer >>> [[WILDCARD],WILDCARD] >>> [access1:24264] [[55095,0],0] posting recv >>> [access1:24264] [[55095,0],0] posting persistent recv on tag 21 for peer >>> [[WILDCARD],WILDCARD] >>> [access1:24264] [[55095,0],0] posting recv >>> [access1:24264] [[55095,0],0] posting persistent recv on tag 22 for peer >>> [[WILDCARD],WILDCARD] >>> [access1:24264] [[55095,0],0] posting recv >>> [access1:24264] [[55095,0],0] posting persistent recv on tag 45 for peer >>> [[WILDCARD],WILDCARD] >>> [access1:24264] [[55095,0],0] posting recv >>> [access1:24264] [[55095,0],0] posting persistent recv on tag 46 for peer >>> [[WILDCARD],WILDCARD] >>> [access1:24264] [[55095,0],0] posting recv >>> [access1:24264] [[55095,0],0] posting persistent recv on tag 1 for peer >>> [[WILDCARD],WILDCARD] >>> [access1:24264] [[55095,0],0] posting recv >>> [access1:24264] [[55095,0],0] posting persistent recv on tag 27 for peer >>> [[WILDCARD],WILDCARD] >>> -------------------------------------------------------------------------- >>> An ORTE daemon has unexpectedly failed after launch and before >>> communicating back to mpirun. This could be caused by a number >>> of factors, including an inability to create a connection back >>> to mpirun due to a lack of common network interfaces and/or no >>> route found between them. Please check network connectivity >>> (including firewalls and network routing requirements). >>> -------------------------------------------------------------------------- >>> [access1:24264] mca: base: close: component oob closed >>> [access1:24264] mca: base: close: unloading component oob >>> [access1:24264] [[55095,0],0] TCP SHUTDOWN >>> [access1:24264] mca: base: close: component tcp closed >>> [access1:24264] mca: base: close: unloading component tcp >>> >>> When i use srun i got: >>> >>> $ salloc -N2 --exclusive -p test -J ompi >>> .... >>> $srun -N 2 ./hello_c >>> Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI >>> semenov@compiler-2 Distribution, ident: 1.9a1r32252, repo rev: r32252, Jul >>> 16, 2014 (nightly snapshot tarball), 146) >>> Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI >>> semenov@compiler-2 Distribution, ident: 1.9a1r32252, repo rev: r32252, Jul >>> 16, 2014 (nightly snapshot tarball), 146) >>> >>> >>> Sun, 20 Jul 2014 09:28:13 -0700 от Ralph Castain <r...@open-mpi.org>: >>> >>> Try adding -mca oob_base_verbose 10 -mca rml_base_verbose 10 to your cmd >>> line. It looks to me like we are unable to connect back to the node where >>> you are running mpirun for some reason. >>> >>> >>> On Jul 20, 2014, at 9:16 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: >>> >>>> I have the same problem in openmpi 1.8.1(Apr 23, 2014). >>>> Does the srun command have a --map-by<foo> mpirun parameter, or can i >>>> chage it from bash enviroment? >>>> >>>> >>>> >>>> -------- Пересылаемое сообщение -------- >>>> От кого: Timur Ismagilov <tismagi...@mail.ru> >>>> Кому: Mike Dubman <mi...@dev.mellanox.co.il> >>>> Копия: Open MPI Users <us...@open-mpi.org> >>>> Дата: Thu, 17 Jul 2014 16:42:24 +0400 >>>> Тема: Re[4]: [OMPI users] Salloc and mpirun problem >>>> >>>> >>>> With Open MPI 1.9a1r32252 (Jul 16, 2014 (nightly snapshot tarball)) i got >>>> this output (same?): >>>> >>>> $ salloc -N2 --exclusive -p test -J ompi >>>> salloc: Granted job allocation 645686 >>>> >>>> $LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so >>>> mpirun -mca mca_base_env_list 'LD_PRELOAD' --mca plm_base_verbose 10 >>>> --debug-daemons -np 1 hello_c >>>> >>>> [access1:04312] mca: base: components_register: registering plm components >>>> [access1:04312] mca: base: components_register: found loaded component >>>> isolated >>>> [access1:04312] mca: base: components_register: component isolated has no >>>> register or open function >>>> [access1:04312] mca: base: components_register: found loaded component rsh >>>> [access1:04312] mca: base: components_register: component rsh register >>>> function successful >>>> [access1:04312] mca: base: components_register: found loaded component >>>> slurm >>>> [access1:04312] mca: base: components_register: component slurm register >>>> function successful >>>> [access1:04312] mca: base: components_open: opening plm components >>>> [access1:04312] mca: base: components_open: found loaded component isolated >>>> [access1:04312] mca: base: components_open: component isolated open >>>> function successful >>>> [access1:04312] mca: base: components_open: found loaded component rsh >>>> [access1:04312] mca: base: components_open: component rsh open function >>>> successful >>>> [access1:04312] mca: base: components_open: found loaded component slurm >>>> [access1:04312] mca: base: components_open: component slurm open function >>>> successful >>>> [access1:04312] mca:base:select: Auto-selecting plm components >>>> [access1:04312] mca:base:select:( plm) Querying component [isolated] >>>> [access1:04312] mca:base:select:( plm) Query of component [isolated] set >>>> priority to 0 >>>> [access1:04312] mca:base:select:( plm) Querying component [rsh] >>>> [access1:04312] mca:base:select:( plm) Query of component [rsh] set >>>> priority to 10 >>>> [access1:04312] mca:base:select:( plm) Querying component [slurm] >>>> [access1:04312] mca:base:select:( plm) Query of component [slurm] set >>>> priority to 75 >>>> [access1:04312] mca:base:select:( plm) Selected component [slurm] >>>> [access1:04312] mca: base: close: component isolated closed >>>> [access1:04312] mca: base: close: unloading component isolated >>>> [access1:04312] mca: base: close: component rsh closed >>>> [access1:04312] mca: base: close: unloading component rsh >>>> Daemon was launched on node1-128-09 - beginning to initialize >>>> Daemon was launched on node1-128-15 - beginning to initialize >>>> Daemon [[39207,0],1] checking in as pid 26240 on host node1-128-09 >>>> [node1-128-09:26240] [[39207,0],1] orted: up and running - waiting for >>>> commands! >>>> Daemon [[39207,0],2] checking in as pid 30129 on host node1-128-15 >>>> [node1-128-15:30129] [[39207,0],2] orted: up and running - waiting for >>>> commands! >>>> srun: error: node1-128-09: task 0: Exited with exit code 1 >>>> srun: Terminating job step 645686.3 >>>> srun: error: node1-128-15: task 1: Exited with exit code 1 >>>> -------------------------------------------------------------------------- >>>> An ORTE daemon has unexpectedly failed after launch and before >>>> communicating back to mpirun. This could be caused by a number >>>> of factors, including an inability to create a connection back >>>> to mpirun due to a lack of common network interfaces and/or no >>>> route found between them. Please check network connectivity >>>> (including firewalls and network routing requirements). >>>> -------------------------------------------------------------------------- >>>> [access1:04312] [[39207,0],0] orted_cmd: received halt_vm cmd >>>> [access1:04312] mca: base: close: component slurm closed >>>> [access1:04312] mca: base: close: unloading component slurm >>>> >>>> >>>> >>>> >>> >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/07/24828.php >> >> >> >> > > > >