This bug should be fixed in tonight's tarball, BTW.
On May 15, 2014, at 9:19 AM, Ralph Castain <r...@open-mpi.org> wrote: > It is an unrelated bug introduced by a different commit - causing mpirun to > segfault upon termination. The fact that you got the hostname to run > indicates that this original fix works, so at least we know the connection > logic is now okay. > > Thanks > Ralph > > > On May 15, 2014, at 3:40 AM, Siegmar Gross > <siegmar.gr...@informatik.hs-fulda.de> wrote: > >> Hi Ralph, >> >>> Just committed a potential fix to the trunk - please let me know >>> if it worked for you >> >> Now I get the hostnames but also a segmentation fault. >> >> tyr fd1026 101 which mpiexec >> /usr/local/openmpi-1.9_64_cc/bin/mpiexec >> tyr fd1026 102 mpiexec -np 3 --host tyr,sunpc1,linpc1 hostname >> tyr.informatik.hs-fulda.de >> linpc1 >> sunpc1 >> [tyr:22835] *** Process received signal *** >> [tyr:22835] Signal: Segmentation Fault (11) >> [tyr:22835] Signal code: Address not mapped (1) >> [tyr:22835] Failing at address: ffffffff7bf16de0 >> /export2/prog/SunOS_sparc/openmpi-1.9_64_cc/lib64/libopen-pal.so.0.0.0:opal_backtrace_print+0x1c >> /export2/prog/SunOS_sparc/openmpi-1.9_64_cc/lib64/libopen-pal.so.0.0.0:0x183960 >> /lib/sparcv9/libc.so.1:0xd8b98 >> /lib/sparcv9/libc.so.1:0xcc70c >> /lib/sparcv9/libc.so.1:0xcc918 >> /export2/prog/SunOS_sparc/openmpi-1.9_64_cc/lib64/libopen-pal.so.0.0.0:0x1ce0e8 >> [ Signal 2125151224 (?)] >> /export2/prog/SunOS_sparc/openmpi-1.9_64_cc/lib64/libopen-pal.so.0.0.0:0x1ccde4 >> /export2/prog/SunOS_sparc/openmpi-1.9_64_cc/lib64/libopen-pal.so.0.0.0:opal_libevent2021_event_del+0x88 >> /export2/prog/SunOS_sparc/openmpi-1.9_64_cc/lib64/libopen-pal.so.0.0.0:opal_libevent2021_event_base_free+0x154 >> /export2/prog/SunOS_sparc/openmpi-1.9_64_cc/lib64/libopen-pal.so.0.0.0:0x1bb9e8 >> /export2/prog/SunOS_sparc/openmpi-1.9_64_cc/lib64/libopen-pal.so.0.0.0:mca_base_framework_close+0x1a0 >> /export2/prog/SunOS_sparc/openmpi-1.9_64_cc/lib64/libopen-pal.so.0.0.0:opal_finalize+0xcc >> /export2/prog/SunOS_sparc/openmpi-1.9_64_cc/lib64/libopen-rte.so.0.0.0:orte_finalize+0x168 >> /export2/prog/SunOS_sparc/openmpi-1.9_64_cc/bin/orterun:orterun+0x23e0 >> /export2/prog/SunOS_sparc/openmpi-1.9_64_cc/bin/orterun:main+0x24 >> /export2/prog/SunOS_sparc/openmpi-1.9_64_cc/bin/orterun:_start+0x12c >> [tyr:22835] *** End of error message *** >> Segmentation fault >> tyr fd1026 103 ompi_info | grep "revision:" >> Open MPI repo revision: r31769 >> Open RTE repo revision: r31769 >> OPAL repo revision: r31769 >> tyr fd1026 104 >> >> >> >> I get the following output in "dbx". >> >> tyr fd1026 104 /opt/solstudio12.3/bin/sparcv9/dbx >> /usr/local/openmpi-1.9_64_cc/bin/mpiexec >> For information about new features see `help changes' >> To remove this message, put `dbxenv suppress_startup_message 7.9' in your >> .dbxrc >> Reading mpiexec >> Reading ld.so.1 >> Reading libopen-rte.so.0.0.0 >> Reading libopen-pal.so.0.0.0 >> Reading libsendfile.so.1 >> Reading libpicl.so.1 >> Reading libkstat.so.1 >> Reading liblgrp.so.1 >> Reading libsocket.so.1 >> Reading libnsl.so.1 >> Reading librt.so.1 >> Reading libm.so.2 >> Reading libthread.so.1 >> Reading libc.so.1 >> Reading libdoor.so.1 >> Reading libaio.so.1 >> Reading libmd.so.1 >> (dbx) run -np 3 --host tyr,sunpc1,linpc1 hostname >> Running: mpiexec -np 3 --host tyr,sunpc1,linpc1 hostname >> (process id 23328) >> Reading libc_psr.so.1 >> Reading mca_shmem_mmap.so >> Reading libmp.so.2 >> Reading libscf.so.1 >> Reading libuutil.so.1 >> Reading libgen.so.1 >> Reading mca_shmem_posix.so >> Reading mca_shmem_sysv.so >> Reading mca_sec_basic.so >> Reading mca_ess_env.so >> Reading mca_ess_hnp.so >> Reading mca_ess_singleton.so >> Reading mca_ess_tool.so >> Reading mca_pstat_test.so >> Reading mca_state_app.so >> Reading mca_state_hnp.so >> Reading mca_state_novm.so >> Reading mca_state_orted.so >> Reading mca_state_staged_hnp.so >> Reading mca_state_staged_orted.so >> Reading mca_state_tool.so >> Reading mca_errmgr_default_app.so >> Reading mca_errmgr_default_hnp.so >> Reading mca_errmgr_default_orted.so >> Reading mca_errmgr_default_tool.so >> Reading mca_plm_isolated.so >> Reading mca_plm_rsh.so >> Reading mca_oob_tcp.so >> Reading mca_rml_oob.so >> Reading mca_routed_binomial.so >> Reading mca_routed_debruijn.so >> Reading mca_routed_direct.so >> Reading mca_routed_radix.so >> Reading mca_dstore_hash.so >> Reading mca_grpcomm_bad.so >> Reading mca_ras_simulator.so >> Reading mca_rmaps_lama.so >> Reading mca_rmaps_mindist.so >> Reading mca_rmaps_ppr.so >> Reading mca_rmaps_rank_file.so >> Reading mca_rmaps_resilient.so >> Reading mca_rmaps_round_robin.so >> Reading mca_rmaps_seq.so >> Reading mca_rmaps_staged.so >> Reading mca_odls_default.so >> Reading mca_rtc_hwloc.so >> Reading mca_iof_hnp.so >> Reading mca_iof_mr_hnp.so >> Reading mca_iof_mr_orted.so >> Reading mca_iof_orted.so >> Reading mca_iof_tool.so >> Reading mca_filem_raw.so >> Reading mca_dfs_app.so >> Reading mca_dfs_orted.so >> Reading mca_dfs_test.so >> tyr.informatik.hs-fulda.de >> linpc1 >> sunpc1 >> t@1 (l@1) signal SEGV (no mapping at the fault address) in >> event_queue_remove at 0xffffffff7e9ce0e8 >> 0xffffffff7e9ce0e8: event_queue_remove+0x01a8: stx %l0, [%l3 + 24] >> Current function is opal_event_base_close >> 62 opal_event_base_free (opal_event_base); >> >> (dbx) check -all >> dbx: warning: check -all will be turned on in the next run of the process >> access checking - OFF >> memuse checking - OFF >> >> (dbx) run -np 3 --host tyr,sunpc1,linpc1 hostname >> Running: mpiexec -np 3 --host tyr,sunpc1,linpc1 hostname >> (process id 23337) >> Reading rtcapihook.so >> Reading libdl.so.1 >> Reading rtcaudit.so >> Reading libmapmalloc.so.1 >> Reading rtcboot.so >> Reading librtc.so >> Reading libmd_psr.so.1 >> RTC: Enabling Error Checking... >> RTC: Using UltraSparc trap mechanism >> RTC: See `help rtc showmap' and `help rtc limitations' for details. >> RTC: Running program... >> Write to unallocated (wua) on thread 1: >> Attempting to write 1 byte at address 0xffffffff79f04000 >> t@1 (l@1) stopped in _readdir at 0xffffffff56574da0 >> 0xffffffff56574da0: _readdir+0x0064: call >> _PROCEDURE_LINKAGE_TABLE_+0x2380 [PLT] ! 0xffffffff56742a80 >> Current function is find_dyn_components >> 393 if (0 != lt_dlforeachfile(dir, save_filename, >> NULL)) { >> (dbx) >> >> >> >> Do you need anything else? >> >> >> KInd regards >> >> Siegmar >> >> >> >> >> On May 14, 2014, at 11:44 AM, Siegmar Gross >> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>> >>>> Hi Ralph, >>>> >>>>> Hmmm...well, that's an interesting naming scheme :-) >>>>> >>>>> Try adding "-mca oob_base_verbose 10 --report-uri -" on your cmd line >>>>> and let's see what it thinks is happening >>>> >>>> >>>> tyr fd1026 105 mpiexec -np 3 --host tyr,sunpc1,linpc1 --mca >>>> oob_base_verbose 10 --report-uri - hostname >>>> [tyr.informatik.hs-fulda.de:06877] mca: base: components_register: >>>> registering oob components >>>> [tyr.informatik.hs-fulda.de:06877] mca: base: components_register: found >>>> loaded component tcp >>>> [tyr.informatik.hs-fulda.de:06877] mca: base: components_register: >>>> component tcp register function successful >>>> [tyr.informatik.hs-fulda.de:06877] mca: base: components_open: opening oob >>>> components >>>> [tyr.informatik.hs-fulda.de:06877] mca: base: components_open: found >>>> loaded component tcp >>>> [tyr.informatik.hs-fulda.de:06877] mca: base: components_open: component >>>> tcp open function successful >>>> [tyr.informatik.hs-fulda.de:06877] mca:oob:select: checking available >>>> component tcp >>>> [tyr.informatik.hs-fulda.de:06877] mca:oob:select: Querying component [tcp] >>>> [tyr.informatik.hs-fulda.de:06877] oob:tcp: component_available called >>>> [tyr.informatik.hs-fulda.de:06877] WORKING INTERFACE 1 KERNEL INDEX 1 >>>> FAMILY: V4 >>>> [tyr.informatik.hs-fulda.de:06877] WORKING INTERFACE 2 KERNEL INDEX 2 >>>> FAMILY: V4 >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] oob:tcp:init creating >>>> module for V4 address on interface bge0 >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] creating OOB-TCP module >>>> for interface bge0 >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] oob:tcp:init adding >>>> 193.174.24.39 to our list of V4 connections >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] TCP STARTUP >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] attempting to bind to >>>> IPv4 port 0 >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] assigned IPv4 port 55567 >>>> [tyr.informatik.hs-fulda.de:06877] mca:oob:select: Adding component to end >>>> [tyr.informatik.hs-fulda.de:06877] mca:oob:select: Found 1 active >>>> transports >>>> 3170566144.0;tcp://193.174.24.39:55567 >>>> [sunpc1:07690] mca: base: components_register: registering oob components >>>> [sunpc1:07690] mca: base: components_register: found loaded component tcp >>>> [sunpc1:07690] mca: base: components_register: component tcp register >>>> function successful >>>> [sunpc1:07690] mca: base: components_open: opening oob components >>>> [sunpc1:07690] mca: base: components_open: found loaded component tcp >>>> [sunpc1:07690] mca: base: components_open: component tcp open function >>>> successful >>>> [sunpc1:07690] mca:oob:select: checking available component tcp >>>> [sunpc1:07690] mca:oob:select: Querying component [tcp] >>>> [sunpc1:07690] oob:tcp: component_available called >>>> [sunpc1:07690] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>>> [sunpc1:07690] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4 >>>> [sunpc1:07690] [[48379,0],1] oob:tcp:init creating module for V4 address >>>> on interface nge0 >>>> [sunpc1:07690] [[48379,0],1] creating OOB-TCP module for interface nge0 >>>> [sunpc1:07690] [[48379,0],1] oob:tcp:init adding 193.174.26.210 to our >>>> list of V4 connections >>>> [sunpc1:07690] [[48379,0],1] TCP STARTUP >>>> [sunpc1:07690] [[48379,0],1] attempting to bind to IPv4 port 0 >>>> [sunpc1:07690] [[48379,0],1] assigned IPv4 port 39616 >>>> [sunpc1:07690] mca:oob:select: Adding component to end >>>> [sunpc1:07690] mca:oob:select: Found 1 active transports >>>> [sunpc1:07690] [[48379,0],1]: set_addr to uri >>>> 3170566144.0;tcp://193.174.24.39:55567 >>>> [sunpc1:07690] [[48379,0],1]:set_addr checking if peer [[48379,0],0] is >>>> reachable via component tcp >>>> [sunpc1:07690] [[48379,0],1] oob:tcp: working peer [[48379,0],0] address >>>> tcp://193.174.24.39:55567 >>>> [sunpc1:07690] [[48379,0],1] UNFOUND KERNEL INDEX -13 FOR ADDRESS >>>> 193.174.24.39 >>>> [sunpc1:07690] [[48379,0],1] PEER [[48379,0],0] MAY BE REACHABLE BY >>>> ROUTING - ASSIGNING MODULE AT KINDEX 2 >> INTERFACE nge0 >>>> [sunpc1:07690] [[48379,0],1] PASSING ADDR 193.174.24.39 TO INTERFACE nge0 >>>> AT KERNEL INDEX 2 >>>> [sunpc1:07690] [[48379,0],1]:tcp set addr for peer [[48379,0],0] >>>> [sunpc1:07690] [[48379,0],1]: peer [[48379,0],0] is reachable via >>>> component tcp >>>> [sunpc1:07690] [[48379,0],1] OOB_SEND: >>>> ../../../../../openmpi-1.8.2a1r31742/orte/mca/rml/oob/rml_oob_send.c:199 >>>> [sunpc1:07690] [[48379,0],1]:tcp:processing set_peer cmd for interface nge0 >>>> [sunpc1:07690] [[48379,0],1] oob:base:send to target [[48379,0],0] >>>> [sunpc1:07690] [[48379,0],1] oob:tcp:send_nb to peer [[48379,0],0]:10 >>>> [sunpc1:07690] [[48379,0],1] tcp:send_nb to peer [[48379,0],0] >>>> [sunpc1:07690] >>>> [[48379,0],1]:[../../../../../openmpi-1.8.2a1r31742/orte/mca/oob/tcp/oob_tcp.c:508] >>>> post send to >> [[48379,0],0] >>>> [sunpc1:07690] >>>> [[48379,0],1]:[../../../../../openmpi-1.8.2a1r31742/orte/mca/oob/tcp/oob_tcp.c:442] >>>> processing >> send to peer >>>> [[48379,0],0]:10 >>>> [sunpc1:07690] >>>> [[48379,0],1]:[../../../../../openmpi-1.8.2a1r31742/orte/mca/oob/tcp/oob_tcp.c:476] >>>> queue pending >> to [[48379,0],0] >>>> [sunpc1:07690] [[48379,0],1] tcp:send_nb: initiating connection to >>>> [[48379,0],0] >>>> [sunpc1:07690] >>>> [[48379,0],1]:[../../../../../openmpi-1.8.2a1r31742/orte/mca/oob/tcp/oob_tcp.c:490] >>>> connect to >> [[48379,0],0] >>>> [sunpc1:07690] [[48379,0],1] orte_tcp_peer_try_connect: attempting to >>>> connect to proc [[48379,0],0] via interface >> nge0 >>>> [sunpc1:07690] [[48379,0],1] oob:tcp:peer creating socket to [[48379,0],0] >>>> [sunpc1:07690] [[48379,0],1] orte_tcp_peer_try_connect: attempting to >>>> connect to proc [[48379,0],0] via interface >> nge0 on socket 10 >>>> [sunpc1:07690] [[48379,0],1] orte_tcp_peer_try_connect: attempting to >>>> connect to proc [[48379,0],0] on >> 193.174.24.39:55567 - 0 retries >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] >>>> mca_oob_tcp_listen_thread: new connection: (15, 0) >> 193.174.26.210:39617 >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] connection_handler: >>>> working connection (15, 11) >> 193.174.26.210:39617 >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] CONNECTION REQUEST ON >>>> UNKNOWN INTERFACE >>>> [sunpc1:07690] [[48379,0],1] waiting for connect completion to >>>> [[48379,0],0] - activating send event >>>> [sunpc1:07690] [[48379,0],1] tcp:send_handler called to send to peer >>>> [[48379,0],0] >>>> [sunpc1:07690] [[48379,0],1] tcp:send_handler CONNECTING >>>> [sunpc1:07690] [[48379,0],1]:tcp:complete_connect called for peer >>>> [[48379,0],0] on socket 10 >>>> [sunpc1:07690] [[48379,0],1] tcp_peer_complete_connect: sending ack to >>>> [[48379,0],0] >>>> [sunpc1:07690] [[48379,0],1] SEND CONNECT ACK >>>> [sunpc1:07690] [[48379,0],1] send blocking of 48 bytes to socket 10 >>>> [sunpc1:07690] [[48379,0],1] connect-ack sent to socket 10 >>>> [sunpc1:07690] [[48379,0],1] tcp_peer_complete_connect: setting read event >>>> on connection to [[48379,0],0] >>>> [linpc1:21511] mca: base: components_register: registering oob components >>>> [linpc1:21511] mca: base: components_register: found loaded component tcp >>>> [linpc1:21511] mca: base: components_register: component tcp register >>>> function successful >>>> [linpc1:21511] mca: base: components_open: opening oob components >>>> [linpc1:21511] mca: base: components_open: found loaded component tcp >>>> [linpc1:21511] mca: base: components_open: component tcp open function >>>> successful >>>> [linpc1:21511] mca:oob:select: checking available component tcp >>>> [linpc1:21511] mca:oob:select: Querying component [tcp] >>>> [linpc1:21511] oob:tcp: component_available called >>>> >>>> [linpc1:21511] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>>> [linpc1:21511] WORKING INTERFACE 2 KERNEL INDEX 2 FAMILY: V4 >>>> [linpc1:21511] [[48379,0],2] oob:tcp:init creating module for V4 address >>>> on interface eth0 >>>> [linpc1:21511] [[48379,0],2] creating OOB-TCP module for interface eth0 >>>> [linpc1:21511] [[48379,0],2] oob:tcp:init adding 193.174.26.208 to our >>>> list of V4 connections >>>> [linpc1:21511] [[48379,0],2] TCP STARTUP >>>> [linpc1:21511] [[48379,0],2] attempting to bind to IPv4 port 0 >>>> [linpc1:21511] [[48379,0],2] assigned IPv4 port 39724 >>>> [linpc1:21511] mca:oob:select: Adding component to end >>>> [linpc1:21511] mca:oob:select: Found 1 active transports >>>> [linpc1:21511] [[48379,0],2]: set_addr to uri >>>> 3170566144.0;tcp://193.174.24.39:55567 >>>> [linpc1:21511] [[48379,0],2]:set_addr checking if peer [[48379,0],0] is >>>> reachable via component tcp >>>> [linpc1:21511] [[48379,0],2] oob:tcp: working peer [[48379,0],0] address >>>> tcp://193.174.24.39:55567 >>>> [linpc1:21511] [[48379,0],2] UNFOUND KERNEL INDEX -13 FOR ADDRESS >>>> 193.174.24.39 >>>> [linpc1:21511] [[48379,0],2] PEER [[48379,0],0] MAY BE REACHABLE BY >>>> ROUTING - ASSIGNING MODULE AT KINDEX 2 >> INTERFACE eth0 >>>> [linpc1:21511] [[48379,0],2] PASSING ADDR 193.174.24.39 TO INTERFACE eth0 >>>> AT KERNEL INDEX 2 >>>> [linpc1:21511] [[48379,0],2]:tcp set addr for peer [[48379,0],0] >>>> [linpc1:21511] [[48379,0],2]: peer [[48379,0],0] is reachable via >>>> component tcp >>>> [linpc1:21511] [[48379,0],2] OOB_SEND: >>>> ../../../../../openmpi-1.8.2a1r31742/orte/mca/rml/oob/rml_oob_send.c:199 >>>> [linpc1:21511] [[48379,0],2]:tcp:processing set_peer cmd for interface eth0 >>>> [linpc1:21511] [[48379,0],2] oob:base:send to target [[48379,0],0] >>>> [linpc1:21511] [[48379,0],2] oob:tcp:send_nb to peer [[48379,0],0]:10 >>>> [linpc1:21511] [[48379,0],2] tcp:send_nb to peer [[48379,0],0] >>>> [linpc1:21511] >>>> [[48379,0],2]:[../../../../../openmpi-1.8.2a1r31742/orte/mca/oob/tcp/oob_tcp.c:508] >>>> post send to >> [[48379,0],0] >>>> [linpc1:21511] >>>> [[48379,0],2]:[../../../../../openmpi-1.8.2a1r31742/orte/mca/oob/tcp/oob_tcp.c:442] >>>> processing >> send to peer >>>> [[48379,0],0]:10 >>>> [linpc1:21511] >>>> [[48379,0],2]:[../../../../../openmpi-1.8.2a1r31742/orte/mca/oob/tcp/oob_tcp.c:476] >>>> queue pending >> to [[48379,0],0] >>>> [linpc1:21511] [[48379,0],2] tcp:send_nb: initiating connection to >>>> [[48379,0],0] >>>> [linpc1:21511] >>>> [[48379,0],2]:[../../../../../openmpi-1.8.2a1r31742/orte/mca/oob/tcp/oob_tcp.c:490] >>>> connect to >> [[48379,0],0] >>>> [linpc1:21511] [[48379,0],2] orte_tcp_peer_try_connect: attempting to >>>> connect to proc [[48379,0],0] via interface >> eth0 >>>> [linpc1:21511] [[48379,0],2] oob:tcp:peer creating socket to [[48379,0],0] >>>> [linpc1:21511] [[48379,0],2] orte_tcp_peer_try_connect: attempting to >>>> connect to proc [[48379,0],0] via interface >> eth0 on socket 9 >>>> [linpc1:21511] [[48379,0],2] orte_tcp_peer_try_connect: attempting to >>>> connect to proc [[48379,0],0] on >> 193.174.24.39:55567 - 0 retries >>>> [linpc1:21511] [[48379,0],2] waiting for connect completion to >>>> [[48379,0],0] - activating send event >>>> [linpc1:21511] [[48379,0],2] tcp:send_handler called to send to peer >>>> [[48379,0],0] >>>> [linpc1:21511] [[48379,0],2] tcp:send_handler CONNECTING >>>> [linpc1:21511] [[48379,0],2]:tcp:complete_connect called for peer >>>> [[48379,0],0] on socket 9 >>>> [linpc1:21511] [[48379,0],2] tcp_peer_complete_connect: sending ack to >>>> [[48379,0],0] >>>> [linpc1:21511] [[48379,0],2] SEND CONNECT ACK >>>> [linpc1:21511] [[48379,0],2] send blocking of 48 bytes to socket 9 >>>> [linpc1:21511] [[48379,0],2] connect-ack sent to socket 9 >>>> [linpc1:21511] [[48379,0],2] tcp_peer_complete_connect: setting read event >>>> on connection to [[48379,0],0] >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] >>>> mca_oob_tcp_listen_thread: new connection: (16, 11) >> 193.174.26.208:53741 >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] connection_handler: >>>> working connection (16, 11) >> 193.174.26.208:53741 >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] CONNECTION REQUEST ON >>>> UNKNOWN INTERFACE >>>> ^CKilled by signal 2. >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] OOB_SEND: >> ../../../../../openmpi-1.8.2a1r31742/orte/mca/rml/oob/rml_oob_send.c:199 >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] OOB_SEND: >> ../../../../../openmpi-1.8.2a1r31742/orte/mca/rml/oob/rml_oob_send.c:199 >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] oob:base:send to target >>>> [[48379,0],1] >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] oob:base:send unknown >>>> peer [[48379,0],1] >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] is NOT reachable by TCP >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] oob:base:send to target >>>> [[48379,0],2] >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] oob:base:send unknown >>>> peer [[48379,0],2] >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] is NOT reachable by TCP >>>> Killed by signal 2. >>>> [tyr.informatik.hs-fulda.de:06877] [[48379,0],0] TCP SHUTDOWN >>>> [tyr.informatik.hs-fulda.de:06877] mca: base: close: component tcp closed >>>> [tyr.informatik.hs-fulda.de:06877] mca: base: close: unloading component >>>> tcp >>>> tyr fd1026 106 >>>> >>>> >>>> Thank you very much for your help in advance. Do you need anything else? >>>> >>>> >>>> Kind regards >>>> >>>> Siegmar >>>> >>>> >>>> >>>>> On May 14, 2014, at 9:06 AM, Siegmar Gross >>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>>>> >>>>>> Hi Ralph, >>>>>> >>>>>>> What are the interfaces on these machines? >>>>>> >>>>>> tyr fd1026 111 ifconfig -a >>>>>> lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu >>>>>> 8232 index 1 >>>>>> inet 127.0.0.1 netmask ff000000 >>>>>> bge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2 >>>>>> inet 193.174.24.39 netmask ffffffe0 broadcast 193.174.24.63 >>>>>> tyr fd1026 112 >>>>>> >>>>>> >>>>>> tyr fd1026 112 ssh sunpc1 ifconfig -a >>>>>> lo0: flags=2001000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4,VIRTUAL> mtu >>>>>> 8232 index 1 >>>>>> inet 127.0.0.1 netmask ff000000 >>>>>> nge0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2 >>>>>> inet 193.174.26.210 netmask ffffffc0 broadcast 193.174.26.255 >>>>>> tyr fd1026 113 >>>>>> >>>>>> >>>>>> tyr fd1026 113 ssh linpc1 /sbin/ifconfig -a >>>>>> eth0 Link encap:Ethernet HWaddr 00:14:4F:23:FD:A8 >>>>>> inet addr:193.174.26.208 Bcast:193.174.26.255 >>>>>> Mask:255.255.255.192 >>>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>>>>> RX packets:18052524 errors:127 dropped:0 overruns:0 frame:127 >>>>>> TX packets:15917888 errors:0 dropped:0 overruns:0 carrier:0 >>>>>> collisions:0 txqueuelen:1000 >>>>>> RX bytes:4158294157 (3965.6 Mb) TX bytes:12060556809 (11501.8 Mb) >>>>>> Interrupt:23 Base address:0x4000 >>>>>> >>>>>> eth1 Link encap:Ethernet HWaddr 00:14:4F:23:FD:A9 >>>>>> BROADCAST MULTICAST MTU:1500 Metric:1 >>>>>> RX packets:0 errors:0 dropped:0 overruns:0 frame:0 >>>>>> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 >>>>>> collisions:0 txqueuelen:1000 >>>>>> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) >>>>>> Interrupt:45 Base address:0xa000 >>>>>> >>>>>> lo Link encap:Local Loopback >>>>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>>>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>>>>> RX packets:1083 errors:0 dropped:0 overruns:0 frame:0 >>>>>> TX packets:1083 errors:0 dropped:0 overruns:0 carrier:0 >>>>>> collisions:0 txqueuelen:0 >>>>>> RX bytes:329323 (321.6 Kb) TX bytes:329323 (321.6 Kb) >>>>>> >>>>>> tyr fd1026 114 >>>>>> >>>>>> >>>>>> Do you need something else? >>>>>> >>>>>> >>>>>> Kind regards >>>>>> >>>>>> Siegmar >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On May 14, 2014, at 7:45 AM, Siegmar Gross >>>>>>> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I just installed openmpi-1.8.2a1r31742 on my machines (Solaris 10 >>>>>>>> Sparc, Solaris 10 x86_64, and openSUSE Linux 12.1 x86_64) with >>>>>>>> Sun C5.12 and still have the following problem. >>>>>>>> >>>>>>>> tyr fd1026 102 which mpiexec >>>>>>>> /usr/local/openmpi-1.8.2_64_cc/bin/mpiexec >>>>>>>> tyr fd1026 103 mpiexec -np 3 --host tyr,sunpc1,linpc1 hostname >>>>>>>> [tyr.informatik.hs-fulda.de:12827] [[37949,0],0] CONNECTION >>>>>>>> REQUEST ON UNKNOWN INTERFACE >>>>>>>> [tyr.informatik.hs-fulda.de:12827] [[37949,0],0] CONNECTION >>>>>>>> REQUEST ON UNKNOWN INTERFACE >>>>>>>> ^CKilled by signal 2. >>>>>>>> Killed by signal 2. >>>>>>>> tyr fd1026 104 >>>>>>>> >>>>>>>> >>>>>>>> The command works fine with openmpi-1.6.6rc1. >>>>>>>> >>>>>>>> tyr fd1026 102 which mpiexec >>>>>>>> /usr/local/openmpi-1.6.6_64_cc/bin/mpiexec >>>>>>>> tyr fd1026 103 mpiexec -np 3 --host tyr,sunpc1,linpc1 hostname >>>>>>>> tyr.informatik.hs-fulda.de >>>>>>>> linpc1 >>>>>>>> sunpc1 >>>>>>>> tyr fd1026 104 >>>>>>>> >>>>>>>> >>>>>>>> I have reported the problem before and I would be grateful, if >>>>>>>> somebody could solve it. Please let me know if I can provide any >>>>>>>> other information. >>>>>>>> >>>>>>>> >>>>>>>> Kind regards >>>>>>>> >>>>>>>> Siegmar >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/