Just committed a potential fix to the trunk - please let me know if it worked for you
On May 14, 2014, at 11:44 AM, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> wrote: > Hi Ralph, > >> Hmmm...well, that's an interesting naming scheme :-) >> >> Try adding "-mca oob_base_verbose 10 --report-uri -" on your cmd line >> and let's see what it thinks is happening > > > tyr fd1026 105 mpiexec -np 3 --host tyr,sunpc1,linpc1 --mca oob_base_verbose > 10 --report-uri - hostname > [tyr.informatik.hs-fulda.de:06877] mca: base: components_register: > registering oob components > [tyr.informatik.hs-fulda.de:06877] mca: base: components_register: found > loaded component tcp > [tyr.informatik.hs-fulda.de:06877] mca: base: components_register: component > tcp register function successful > [tyr.informatik.hs-fulda.de:06877] mca: base: components_open: opening oob > components > [tyr.informatik.hs-fulda.de:06877] mca: base: components_open: found loaded > component tcp > [tyr.informatik.hs-fulda.de:06877] mca: base: components_open: component tcp > open function successful > [tyr.informatik.hs-fulda.de:06877] mca:oob:select: checking available > component tcp > [tyr.informatik.hs-fulda.de:06877] mca:oob:select: Querying component [tcp] > [tyr.informatik.hs-fulda.de:06877] oob:tcp: component_available called > [tyr.informatik.hs-fulda.de:06877] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: > V4 > [tyr.informatik.hs-fulda.de:06877] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: > V4 > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] oob:tcp:init creating module > for V4 address on interface bge0 > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] creating OOB-TCP module for > interface bge0 > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] oob:tcp:init adding > 193.174.24.39 to our list of V4 connections > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] TCP STARTUP > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] attempting to bind to IPv4 > port 0 > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] assigned IPv4 port 55567 > [tyr.informatik.hs-fulda.de:06877] mca:oob:select: Adding component to end > [tyr.informatik.hs-fulda.de:06877] mca:oob:select: Found 1 active transports > 3170566144.0;tcp://193.174.24.39:55567 > [sunpc1:07690] mca: base: components_register: registering oob components > [sunpc1:07690] mca: base: components_register: found loaded component tcp > [sunpc1:07690] mca: base: components_register: component tcp register > function successful > [sunpc1:07690] mca: base: components_open: opening oob components > [sunpc1:07690] mca: base: components_open: found loaded component tcp > [sunpc1:07690] mca: base: components_open: component tcp open function > successful > [sunpc1:07690] mca:oob:select: checking available component tcp > [sunpc1:07690] mca:oob:select: Querying component [tcp] > [sunpc1:07690] oob:tcp: component_available called > [sunpc1:07690] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [sunpc1:07690] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4 > [sunpc1:07690] [[48379,0],1] oob:tcp:init creating module for V4 address on > interface nge0 > [sunpc1:07690] [[48379,0],1] creating OOB-TCP module for interface nge0 > [sunpc1:07690] [[48379,0],1] oob:tcp:init adding 193.174.26.210 to our list > of V4 connections > [sunpc1:07690] [[48379,0],1] TCP STARTUP > [sunpc1:07690] [[48379,0],1] attempting to bind to IPv4 port 0 > [sunpc1:07690] [[48379,0],1] assigned IPv4 port 39616 > [sunpc1:07690] mca:oob:select: Adding component to end > [sunpc1:07690] mca:oob:select: Found 1 active transports > [sunpc1:07690] [[48379,0],1]: set_addr to uri > 3170566144.0;tcp://193.174.24.39:55567 > [sunpc1:07690] [[48379,0],1]:set_addr checking if peer [[48379,0],0] is > reachable via component tcp > [sunpc1:07690] [[48379,0],1] oob:tcp: working peer [[48379,0],0] address > tcp://193.174.24.39:55567 > [sunpc1:07690] [[48379,0],1] UNFOUND KERNEL INDEX -13 FOR ADDRESS > 193.174.24.39 > [sunpc1:07690] [[48379,0],1] PEER [[48379,0],0] MAY BE REACHABLE BY ROUTING - > ASSIGNING MODULE AT KINDEX 2 INTERFACE nge0 > [sunpc1:07690] [[48379,0],1] PASSING ADDR 193.174.24.39 TO INTERFACE nge0 AT > KERNEL INDEX 2 > [sunpc1:07690] [[48379,0],1]:tcp set addr for peer [[48379,0],0] > [sunpc1:07690] [[48379,0],1]: peer [[48379,0],0] is reachable via component > tcp > [sunpc1:07690] [[48379,0],1] OOB_SEND: > ../../../../../openmpi-1.8.2a1r31742/orte/mca/rml/oob/rml_oob_send.c:199 > [sunpc1:07690] [[48379,0],1]:tcp:processing set_peer cmd for interface nge0 > [sunpc1:07690] [[48379,0],1] oob:base:send to target [[48379,0],0] > [sunpc1:07690] [[48379,0],1] oob:tcp:send_nb to peer [[48379,0],0]:10 > [sunpc1:07690] [[48379,0],1] tcp:send_nb to peer [[48379,0],0] > [sunpc1:07690] > [[48379,0],1]:[../../../../../openmpi-1.8.2a1r31742/orte/mca/oob/tcp/oob_tcp.c:508] > post send to [[48379,0],0] > [sunpc1:07690] > [[48379,0],1]:[../../../../../openmpi-1.8.2a1r31742/orte/mca/oob/tcp/oob_tcp.c:442] > processing send to peer > [[48379,0],0]:10 > [sunpc1:07690] > [[48379,0],1]:[../../../../../openmpi-1.8.2a1r31742/orte/mca/oob/tcp/oob_tcp.c:476] > queue pending to [[48379,0],0] > [sunpc1:07690] [[48379,0],1] tcp:send_nb: initiating connection to > [[48379,0],0] > [sunpc1:07690] > [[48379,0],1]:[../../../../../openmpi-1.8.2a1r31742/orte/mca/oob/tcp/oob_tcp.c:490] > connect to [[48379,0],0] > [sunpc1:07690] [[48379,0],1] orte_tcp_peer_try_connect: attempting to connect > to proc [[48379,0],0] via interface nge0 > [sunpc1:07690] [[48379,0],1] oob:tcp:peer creating socket to [[48379,0],0] > [sunpc1:07690] [[48379,0],1] orte_tcp_peer_try_connect: attempting to connect > to proc [[48379,0],0] via interface nge0 on socket 10 > [sunpc1:07690] [[48379,0],1] orte_tcp_peer_try_connect: attempting to connect > to proc [[48379,0],0] on 193.174.24.39:55567 - 0 retries > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] mca_oob_tcp_listen_thread: > new connection: (15, 0) 193.174.26.210:39617 > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] connection_handler: working > connection (15, 11) 193.174.26.210:39617 > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] CONNECTION REQUEST ON > UNKNOWN INTERFACE > [sunpc1:07690] [[48379,0],1] waiting for connect completion to [[48379,0],0] > - activating send event > [sunpc1:07690] [[48379,0],1] tcp:send_handler called to send to peer > [[48379,0],0] > [sunpc1:07690] [[48379,0],1] tcp:send_handler CONNECTING > [sunpc1:07690] [[48379,0],1]:tcp:complete_connect called for peer > [[48379,0],0] on socket 10 > [sunpc1:07690] [[48379,0],1] tcp_peer_complete_connect: sending ack to > [[48379,0],0] > [sunpc1:07690] [[48379,0],1] SEND CONNECT ACK > [sunpc1:07690] [[48379,0],1] send blocking of 48 bytes to socket 10 > [sunpc1:07690] [[48379,0],1] connect-ack sent to socket 10 > [sunpc1:07690] [[48379,0],1] tcp_peer_complete_connect: setting read event on > connection to [[48379,0],0] > [linpc1:21511] mca: base: components_register: registering oob components > [linpc1:21511] mca: base: components_register: found loaded component tcp > [linpc1:21511] mca: base: components_register: component tcp register > function successful > [linpc1:21511] mca: base: components_open: opening oob components > [linpc1:21511] mca: base: components_open: found loaded component tcp > [linpc1:21511] mca: base: components_open: component tcp open function > successful > [linpc1:21511] mca:oob:select: checking available component tcp > [linpc1:21511] mca:oob:select: Querying component [tcp] > [linpc1:21511] oob:tcp: component_available called > > [linpc1:21511] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [linpc1:21511] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4 > [linpc1:21511] [[48379,0],2] oob:tcp:init creating module for V4 address on > interface eth0 > [linpc1:21511] [[48379,0],2] creating OOB-TCP module for interface eth0 > [linpc1:21511] [[48379,0],2] oob:tcp:init adding 193.174.26.208 to our list > of V4 connections > [linpc1:21511] [[48379,0],2] TCP STARTUP > [linpc1:21511] [[48379,0],2] attempting to bind to IPv4 port 0 > [linpc1:21511] [[48379,0],2] assigned IPv4 port 39724 > [linpc1:21511] mca:oob:select: Adding component to end > [linpc1:21511] mca:oob:select: Found 1 active transports > [linpc1:21511] [[48379,0],2]: set_addr to uri > 3170566144.0;tcp://193.174.24.39:55567 > [linpc1:21511] [[48379,0],2]:set_addr checking if peer [[48379,0],0] is > reachable via component tcp > [linpc1:21511] [[48379,0],2] oob:tcp: working peer [[48379,0],0] address > tcp://193.174.24.39:55567 > [linpc1:21511] [[48379,0],2] UNFOUND KERNEL INDEX -13 FOR ADDRESS > 193.174.24.39 > [linpc1:21511] [[48379,0],2] PEER [[48379,0],0] MAY BE REACHABLE BY ROUTING - > ASSIGNING MODULE AT KINDEX 2 INTERFACE eth0 > [linpc1:21511] [[48379,0],2] PASSING ADDR 193.174.24.39 TO INTERFACE eth0 AT > KERNEL INDEX 2 > [linpc1:21511] [[48379,0],2]:tcp set addr for peer [[48379,0],0] > [linpc1:21511] [[48379,0],2]: peer [[48379,0],0] is reachable via component > tcp > [linpc1:21511] [[48379,0],2] OOB_SEND: > ../../../../../openmpi-1.8.2a1r31742/orte/mca/rml/oob/rml_oob_send.c:199 > [linpc1:21511] [[48379,0],2]:tcp:processing set_peer cmd for interface eth0 > [linpc1:21511] [[48379,0],2] oob:base:send to target [[48379,0],0] > [linpc1:21511] [[48379,0],2] oob:tcp:send_nb to peer [[48379,0],0]:10 > [linpc1:21511] [[48379,0],2] tcp:send_nb to peer [[48379,0],0] > [linpc1:21511] > [[48379,0],2]:[../../../../../openmpi-1.8.2a1r31742/orte/mca/oob/tcp/oob_tcp.c:508] > post send to [[48379,0],0] > [linpc1:21511] > [[48379,0],2]:[../../../../../openmpi-1.8.2a1r31742/orte/mca/oob/tcp/oob_tcp.c:442] > processing send to peer > [[48379,0],0]:10 > [linpc1:21511] > [[48379,0],2]:[../../../../../openmpi-1.8.2a1r31742/orte/mca/oob/tcp/oob_tcp.c:476] > queue pending to [[48379,0],0] > [linpc1:21511] [[48379,0],2] tcp:send_nb: initiating connection to > [[48379,0],0] > [linpc1:21511] > [[48379,0],2]:[../../../../../openmpi-1.8.2a1r31742/orte/mca/oob/tcp/oob_tcp.c:490] > connect to [[48379,0],0] > [linpc1:21511] [[48379,0],2] orte_tcp_peer_try_connect: attempting to connect > to proc [[48379,0],0] via interface eth0 > [linpc1:21511] [[48379,0],2] oob:tcp:peer creating socket to [[48379,0],0] > [linpc1:21511] [[48379,0],2] orte_tcp_peer_try_connect: attempting to connect > to proc [[48379,0],0] via interface eth0 on socket 9 > [linpc1:21511] [[48379,0],2] orte_tcp_peer_try_connect: attempting to connect > to proc [[48379,0],0] on 193.174.24.39:55567 - 0 retries > [linpc1:21511] [[48379,0],2] waiting for connect completion to [[48379,0],0] > - activating send event > [linpc1:21511] [[48379,0],2] tcp:send_handler called to send to peer > [[48379,0],0] > [linpc1:21511] [[48379,0],2] tcp:send_handler CONNECTING > [linpc1:21511] [[48379,0],2]:tcp:complete_connect called for peer > [[48379,0],0] on socket 9 > [linpc1:21511] [[48379,0],2] tcp_peer_complete_connect: sending ack to > [[48379,0],0] > [linpc1:21511] [[48379,0],2] SEND CONNECT ACK > [linpc1:21511] [[48379,0],2] send blocking of 48 bytes to socket 9 > [linpc1:21511] [[48379,0],2] connect-ack sent to socket 9 > [linpc1:21511] [[48379,0],2] tcp_peer_complete_connect: setting read event on > connection to [[48379,0],0] > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] mca_oob_tcp_listen_thread: > new connection: (16, 11) 193.174.26.208:53741 > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] connection_handler: working > connection (16, 11) 193.174.26.208:53741 > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] CONNECTION REQUEST ON > UNKNOWN INTERFACE > ^CKilled by signal 2. > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] OOB_SEND: > ../../../../../openmpi-1.8.2a1r31742/orte/mca/rml/oob/rml_oob_send.c:199 > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] OOB_SEND: > ../../../../../openmpi-1.8.2a1r31742/orte/mca/rml/oob/rml_oob_send.c:199 > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] oob:base:send to target > [[48379,0],1] > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] oob:base:send unknown peer > [[48379,0],1] > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] is NOT reachable by TCP > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] oob:base:send to target > [[48379,0],2] > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] oob:base:send unknown peer > [[48379,0],2] > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] is NOT reachable by TCP > Killed by signal 2. > [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] TCP SHUTDOWN > [tyr.informatik.hs-fulda.de:06877] mca: base: close: component tcp closed > [tyr.informatik.hs-fulda.de:06877] mca: base: close: unloading component tcp > tyr fd1026 106 > > > Thank you very much for your help in advance. Do you need anything else? > > > Kind regards > > Siegmar > > > >> On May 14, 2014, at 9:06 AM, Siegmar Gross >> <siegmar.gr...@informatik.hs-fulda.de> wrote: >> >>> Hi Ralph, >>> >>>> What are the interfaces on these machines? >>> >>> tyr fd1026 111 ifconfig -a >>> lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 >>> index 1 >>> inet 127.0.0.1 netmask ff000000 >>> bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2 >>> inet 193.174.24.39 netmask ffffffe0 broadcast 193.174.24.63 >>> tyr fd1026 112 >>> >>> >>> tyr fd1026 112 ssh sunpc1 ifconfig -a >>> lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu 8232 >>> index 1 >>> inet 127.0.0.1 netmask ff000000 >>> nge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2 >>> inet 193.174.26.210 netmask ffffffc0 broadcast 193.174.26.255 >>> tyr fd1026 113 >>> >>> >>> tyr fd1026 113 ssh linpc1 /sbin/ifconfig -a >>> eth0 Link encap:Ethernet HWaddr 00:14:4F:23:FD:A8 >>> inet addr:193.174.26.208 Bcast:193.174.26.255 Mask:255.255.255.192 >>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>> RX packets:18052524 errors:127 dropped:0 overruns:0 frame:127 >>> TX packets:15917888 errors:0 dropped:0 overruns:0 carrier:0 >>> collisions:0 txqueuelen:1000 >>> RX bytes:4158294157 (3965.6 Mb) TX bytes:12060556809 (11501.8 Mb) >>> Interrupt:23 Base address:0x4000 >>> >>> eth1 Link encap:Ethernet HWaddr 00:14:4F:23:FD:A9 >>> BROADCAST MULTICAST MTU:1500 Metric:1 >>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>> collisions:0 txqueuelen:1000 >>> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) >>> Interrupt:45 Base address:0xa000 >>> >>> lo Link encap:Local Loopback >>> inet addr:127.0.0.1 Mask:255.0.0.0 >>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>> RX packets:1083 errors:0 dropped:0 overruns:0 frame:0 >>> TX packets:1083 errors:0 dropped:0 overruns:0 carrier:0 >>> collisions:0 txqueuelen:0 >>> RX bytes:329323 (321.6 Kb) TX bytes:329323 (321.6 Kb) >>> >>> tyr fd1026 114 >>> >>> >>> Do you need something else? >>> >>> >>> Kind regards >>> >>> Siegmar >>> >>> >>> >>> >>>> On May 14, 2014, at 7:45 AM, Siegmar Gross >>>> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>>> >>>>> Hi, >>>>> >>>>> I just installed openmpi-1.8.2a1r31742 on my machines (Solaris 10 >>>>> Sparc, Solaris 10 x86_64, and openSUSE Linux 12.1 x86_64) with >>>>> Sun C5.12 and still have the following problem. >>>>> >>>>> tyr fd1026 102 which mpiexec >>>>> /usr/local/openmpi-1.8.2_64_cc/bin/mpiexec >>>>> tyr fd1026 103 mpiexec -np 3 --host tyr,sunpc1,linpc1 hostname >>>>> [tyr.informatik.hs-fulda.de:12827] [[37949,0],0] CONNECTION >>>>> REQUEST ON UNKNOWN INTERFACE >>>>> [tyr.informatik.hs-fulda.de:12827] [[37949,0],0] CONNECTION >>>>> REQUEST ON UNKNOWN INTERFACE >>>>> ^CKilled by signal 2. >>>>> Killed by signal 2. >>>>> tyr fd1026 104 >>>>> >>>>> >>>>> The command works fine with openmpi-1.6.6rc1. >>>>> >>>>> tyr fd1026 102 which mpiexec >>>>> /usr/local/openmpi-1.6.6_64_cc/bin/mpiexec >>>>> tyr fd1026 103 mpiexec -np 3 --host tyr,sunpc1,linpc1 hostname >>>>> tyr.informatik.hs-fulda.de >>>>> linpc1 >>>>> sunpc1 >>>>> tyr fd1026 104 >>>>> >>>>> >>>>> I have reported the problem before and I would be grateful, if >>>>> somebody could solve it. Please let me know if I can provide any >>>>> other information. >>>>> >>>>> >>>>> Kind regards >>>>> >>>>> Siegmar >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users