By the way, it just *feels* like a race condition somewhere, because the very next invocation worked (I ctrl-C'd it)...
[humphrey@zelda01 humphrey]$ mpiexec -d --mca btl_tcp_if_include eth0 --mca oob_tcp_include eth0 -np 2 a.out [zelda01.localdomain:19923] procdir: (null) [zelda01.localdomain:19923] jobdir: (null) [zelda01.localdomain:19923] unidir: /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe [zelda01.localdomain:19923] top: openmpi-sessions-humphrey@zelda01.localdomain_0 [zelda01.localdomain:19923] tmp: /tmp [zelda01.localdomain:19923] connect_uni: contact info read [zelda01.localdomain:19923] connect_uni: connection not allowed [zelda01.localdomain:19923] [0,0,0] setting up session dir with [zelda01.localdomain:19923] tmpdir /tmp [zelda01.localdomain:19923] universe default-universe-19923 [zelda01.localdomain:19923] user humphrey [zelda01.localdomain:19923] host zelda01.localdomain [zelda01.localdomain:19923] jobid 0 [zelda01.localdomain:19923] procid 0 [zelda01.localdomain:19923] procdir: /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe-19923/ 0/0 [zelda01.localdomain:19923] jobdir: /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe-19923/ 0 [zelda01.localdomain:19923] unidir: /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe-19923 [zelda01.localdomain:19923] top: openmpi-sessions-humphrey@zelda01.localdomain_0 [zelda01.localdomain:19923] tmp: /tmp [zelda01.localdomain:19923] [0,0,0] contact_file /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe-19923/ universe-setup.txt [zelda01.localdomain:19923] [0,0,0] wrote setup file [zelda01.localdomain:19923] spawn: in job_state_callback(jobid = 1, state = 0x1) [zelda01.localdomain:19923] pls:rsh: local csh: 0, local bash: 1 [zelda01.localdomain:19923] pls:rsh: assuming same remote shell as local shell [zelda01.localdomain:19923] pls:rsh: remote csh: 0, remote bash: 1 [zelda01.localdomain:19923] pls:rsh: final template argv: [zelda01.localdomain:19923] pls:rsh: ssh <template> orted --debug --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename <template> --universe humphrey@zelda01.localdomain:default-universe-19923 --nsreplica "0.0.0;tcp://130.207.252.131:35889" --gprreplica "0.0.0;tcp://130.207.252.131:35889" --mpi-call-yield 0 [zelda01.localdomain:19923] pls:rsh: launching on node localhost [zelda01.localdomain:19923] pls:rsh: oversubscribed -- setting mpi_yield_when_idle to 1 (1 2) [zelda01.localdomain:19923] pls:rsh: localhost is a LOCAL node [zelda01.localdomain:19923] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost --universe humphrey@zelda01.localdomain:default-universe-19923 --nsreplica "0.0.0;tcp://130.207.252.131:35889" --gprreplica "0.0.0;tcp://130.207.252.131:35889" --mpi-call-yield 1 [zelda01.localdomain:19924] [0,0,1] setting up session dir with [zelda01.localdomain:19924] universe default-universe-19923 [zelda01.localdomain:19924] user humphrey [zelda01.localdomain:19924] host localhost [zelda01.localdomain:19924] jobid 0 [zelda01.localdomain:19924] procid 1 [zelda01.localdomain:19924] procdir: /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-19923/0/1 [zelda01.localdomain:19924] jobdir: /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-19923/0 [zelda01.localdomain:19924] unidir: /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-19923 [zelda01.localdomain:19924] top: openmpi-sessions-humphrey@localhost_0 [zelda01.localdomain:19924] tmp: /tmp [zelda01.localdomain:19925] [0,1,1] setting up session dir with [zelda01.localdomain:19925] universe default-universe-19923 [zelda01.localdomain:19925] user humphrey [zelda01.localdomain:19925] host localhost [zelda01.localdomain:19925] jobid 1 [zelda01.localdomain:19925] procid 1 [zelda01.localdomain:19925] procdir: /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-19923/1/1 [zelda01.localdomain:19925] jobdir: /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-19923/1 [zelda01.localdomain:19925] unidir: /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-19923 [zelda01.localdomain:19925] top: openmpi-sessions-humphrey@localhost_0 [zelda01.localdomain:19925] tmp: /tmp [zelda01.localdomain:19926] [0,1,0] setting up session dir with [zelda01.localdomain:19926] universe default-universe-19923 [zelda01.localdomain:19926] user humphrey [zelda01.localdomain:19926] host localhost [zelda01.localdomain:19926] jobid 1 [zelda01.localdomain:19926] procid 0 [zelda01.localdomain:19926] procdir: /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-19923/1/0 [zelda01.localdomain:19926] jobdir: /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-19923/1 [zelda01.localdomain:19926] unidir: /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-19923 [zelda01.localdomain:19926] top: openmpi-sessions-humphrey@localhost_0 [zelda01.localdomain:19926] tmp: /tmp [zelda01.localdomain:19923] spawn: in job_state_callback(jobid = 1, state = 0x3) [zelda01.localdomain:19923] Info: Setting up debugger process table for applications MPIR_being_debugged = 0 MPIR_debug_gate = 0 MPIR_debug_state = 1 MPIR_acquired_pre_main = 0 MPIR_i_am_starter = 0 MPIR_proctable_size = 2 MPIR_proctable: (i, host, exe, pid) = (0, localhost, /home/humphrey/a.out, 19925) (i, host, exe, pid) = (1, localhost, /home/humphrey/a.out, 19926) [zelda01.localdomain:19923] spawn: in job_state_callback(jobid = 1, state = 0x4) [zelda01.localdomain:19926] [0,1,0] ompi_mpi_init completed [zelda01.localdomain:19925] [0,1,1] ompi_mpi_init completed 2 PE'S AS A 2 BY 1 GRID HALO2A NPES,N = 2 2 TIME = 0.000013 SECONDS HALO2A NPES,N = 2 4 TIME = 0.000013 SECONDS HALO2A NPES,N = 2 8 TIME = 0.000013 SECONDS HALO2A NPES,N = 2 16 TIME = 0.000014 SECONDS HALO2A NPES,N = 2 32 TIME = 0.000015 SECONDS HALO2A NPES,N = 2 64 TIME = 0.000016 SECONDS HALO2A NPES,N = 2 128 TIME = 0.000026 SECONDS HALO2A NPES,N = 2 256 TIME = 0.000037 SECONDS mpiexec: killing job... > -----Original Message----- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Jeff Squyres > Sent: Wednesday, November 09, 2005 10:41 PM > To: Open MPI Users > Subject: Re: [O-MPI users] can't get openmpi to run across twomulti- > NICmachines > > Sorry for the delay in replying -- it's a crazy week here preparing for > SC next week. > > I'm double checking the code, and I don't see any obvious problems with > the btl tcp include stuff. > > Can you also specify that you want OMPI's "out of band" communication > to use a specific network? > > > mpiexec -d --mca btl_tcp_if_include eth0 --mca oob_tcp_include eth0 > > -np 2 a.out > > With the segv's, do you get meaningful core dumps? Can you send > backtraces? > > > > On Nov 8, 2005, at 3:02 PM, Marty Humphrey wrote: > > > It's taken me a while, but I've simplified the experiment... > > > > In a nutshell, I'm seeing strange behavior in my multi-NIC box when I > > attempt to execute " mpiexec -d --mca btl_tcp_if_include eth0 -np 2 > > a.out". > > I have three different observed behaviors: > > > > [1] mpi worker rank 0 displays the banner and then just hangs > > (apparently > > trying to exchange MPI messages, which don't get delivered) > > > > 2 PE'S AS A 2 BY 1 GRID > > > > [2] it starts progressing (spitting out domain-specific msgs): > > > > 2 PE'S AS A 2 BY 1 GRID > > > > HALO2A NPES,N = 2 2 TIME = 0.000007 SECONDS > > HALO2A NPES,N = 2 4 TIME = 0.000007 SECONDS > > HALO2A NPES,N = 2 8 TIME = 0.000007 SECONDS > > HALO2A NPES,N = 2 16 TIME = 0.000008 SECONDS > > HALO2A NPES,N = 2 32 TIME = 0.000009 SECONDS > > > > [3] I get failure pretty quickly, with the line " mpiexec noticed that > > job > > rank 1 with PID 20425 on node "localhost" exited on signal 11." > > > > Here's the output of "ifconfig": > > > > [humphrey@zelda01 humphrey]$ /sbin/ifconfig > > eth0 Link encap:Ethernet HWaddr 00:11:43:DC:EA:EE > > inet addr:130.207.252.131 Bcast:130.207.252.255 > > Mask:255.255.255.0 > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > RX packets:2441905 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:112786 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:197322445 (188.1 Mb) TX bytes:32906750 (31.3 Mb) > > Base address:0xecc0 Memory:dfae0000-dfb00000 > > > > eth2 Link encap:Ethernet HWaddr 00:11:95:C7:28:82 > > inet addr:10.0.0.11 Bcast:10.0.0.255 Mask:255.255.255.0 > > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > > RX packets:11598757 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:7224590 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:1000 > > RX bytes:3491651158 (3329.8 Mb) TX bytes:1916674000 (1827.8 > > Mb) > > Interrupt:77 Base address:0xcc00 > > > > ipsec0 Link encap:Ethernet HWaddr 00:11:43:DC:EA:EE > > inet addr:130.207.252.131 Mask:255.255.255.0 > > UP RUNNING NOARP MTU:16260 Metric:1 > > RX packets:40113 errors:0 dropped:40113 overruns:0 frame:0 > > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:10 > > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) > > > > lo Link encap:Local Loopback > > inet addr:127.0.0.1 Mask:255.0.0.0 > > UP LOOPBACK RUNNING MTU:16436 Metric:1 > > RX packets:4742 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:4742 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:0 > > RX bytes:2369841 (2.2 Mb) TX bytes:2369841 (2.2 Mb) > > > > This is with openmpi-1.1a1r8038 . > > > > Here is the output of a hanging invocation.... > > > > ----- begin hanging invocation ---- > > [humphrey@zelda01 humphrey]$ mpiexec -d --mca btl_tcp_if_include eth0 > > -np 2 > > a.out > > [zelda01.localdomain:20455] procdir: (null) > > [zelda01.localdomain:20455] jobdir: (null) > > [zelda01.localdomain:20455] unidir: > > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe > > [zelda01.localdomain:20455] top: > > openmpi-sessions-humphrey@zelda01.localdomain_0 > > [zelda01.localdomain:20455] tmp: /tmp > > [zelda01.localdomain:20455] connect_uni: contact info read > > [zelda01.localdomain:20455] connect_uni: connection not allowed > > [zelda01.localdomain:20455] [0,0,0] setting up session dir with > > [zelda01.localdomain:20455] tmpdir /tmp > > [zelda01.localdomain:20455] universe default-universe-20455 > > [zelda01.localdomain:20455] user humphrey > > [zelda01.localdomain:20455] host zelda01.localdomain > > [zelda01.localdomain:20455] jobid 0 > > [zelda01.localdomain:20455] procid 0 > > [zelda01.localdomain:20455] procdir: > > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe > > -20455/ > > 0/0 > > [zelda01.localdomain:20455] jobdir: > > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe > > -20455/ > > 0 > > [zelda01.localdomain:20455] unidir: > > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe > > -20455 > > [zelda01.localdomain:20455] top: > > openmpi-sessions-humphrey@zelda01.localdomain_0 > > [zelda01.localdomain:20455] tmp: /tmp > > [zelda01.localdomain:20455] [0,0,0] contact_file > > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe > > -20455/ > > universe-setup.txt > > [zelda01.localdomain:20455] [0,0,0] wrote setup file > > [zelda01.localdomain:20455] pls:rsh: local csh: 0, local bash: 1 > > [zelda01.localdomain:20455] pls:rsh: assuming same remote shell as > > local > > shell > > [zelda01.localdomain:20455] pls:rsh: remote csh: 0, remote bash: 1 > > [zelda01.localdomain:20455] pls:rsh: final template argv: > > [zelda01.localdomain:20455] pls:rsh: ssh <template> orted --debug > > --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename > > <template> --universe > > humphrey@zelda01.localdomain:default-universe-20455 > > --nsreplica > > "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp:// > > 130.207.252.1 > > 31:35465" --gprreplica > > "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp:// > > 130.207.252.1 > > 31:35465" --mpi-call-yield 0 > > [zelda01.localdomain:20455] pls:rsh: launching on node localhost > > [zelda01.localdomain:20455] pls:rsh: oversubscribed -- setting > > mpi_yield_when_idle to 1 (1 2) > > [zelda01.localdomain:20455] pls:rsh: localhost is a LOCAL node > > [zelda01.localdomain:20455] pls:rsh: executing: orted --debug > > --bootproxy 1 > > --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost > > --universe > > humphrey@zelda01.localdomain:default-universe-20455 --nsreplica > > "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp:// > > 130.207.252.1 > > 31:35465" --gprreplica > > "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp:// > > 130.207.252.1 > > 31:35465" --mpi-call-yield 1 > > [zelda01.localdomain:20456] [0,0,1] setting up session dir with > > [zelda01.localdomain:20456] universe default-universe-20455 > > [zelda01.localdomain:20456] user humphrey > > [zelda01.localdomain:20456] host localhost > > [zelda01.localdomain:20456] jobid 0 > > [zelda01.localdomain:20456] procid 1 > > [zelda01.localdomain:20456] procdir: > > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455/0/1 > > [zelda01.localdomain:20456] jobdir: > > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455/0 > > [zelda01.localdomain:20456] unidir: > > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455 > > [zelda01.localdomain:20456] top: openmpi-sessions-humphrey@localhost_0 > > [zelda01.localdomain:20456] tmp: /tmp > > [zelda01.localdomain:20457] [0,1,1] setting up session dir with > > [zelda01.localdomain:20457] universe default-universe-20455 > > [zelda01.localdomain:20457] user humphrey > > [zelda01.localdomain:20457] host localhost > > [zelda01.localdomain:20457] jobid 1 > > [zelda01.localdomain:20457] procid 1 > > [zelda01.localdomain:20457] procdir: > > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455/1/1 > > [zelda01.localdomain:20457] jobdir: > > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455/1 > > [zelda01.localdomain:20457] unidir: > > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455 > > [zelda01.localdomain:20457] top: openmpi-sessions-humphrey@localhost_0 > > [zelda01.localdomain:20457] tmp: /tmp > > [zelda01.localdomain:20458] [0,1,0] setting up session dir with > > [zelda01.localdomain:20458] universe default-universe-20455 > > [zelda01.localdomain:20458] user humphrey > > [zelda01.localdomain:20458] host localhost > > [zelda01.localdomain:20458] jobid 1 > > [zelda01.localdomain:20458] procid 0 > > [zelda01.localdomain:20458] procdir: > > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455/1/0 > > [zelda01.localdomain:20458] jobdir: > > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455/1 > > [zelda01.localdomain:20458] unidir: > > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455 > > [zelda01.localdomain:20458] top: openmpi-sessions-humphrey@localhost_0 > > [zelda01.localdomain:20458] tmp: /tmp > > [zelda01.localdomain:20455] spawn: in job_state_callback(jobid = 1, > > state = > > 0x3) > > [zelda01.localdomain:20455] Info: Setting up debugger process table for > > applications > > MPIR_being_debugged = 0 > > MPIR_debug_gate = 0 > > MPIR_debug_state = 1 > > MPIR_acquired_pre_main = 0 > > MPIR_i_am_starter = 0 > > MPIR_proctable_size = 2 > > MPIR_proctable: > > (i, host, exe, pid) = (0, localhost, /home/humphrey/a.out, 20457) > > (i, host, exe, pid) = (1, localhost, /home/humphrey/a.out, 20458) > > [zelda01.localdomain:20455] spawn: in job_state_callback(jobid = 1, > > state = > > 0x4) > > [zelda01.localdomain:20458] [0,1,0] ompi_mpi_init completed > > [zelda01.localdomain:20457] [0,1,1] ompi_mpi_init completed > > > > 2 PE'S AS A 2 BY 1 GRID > > ------ end hanging invocation ----- > > > > Here's the 1-in-approximately-20 that started working... > > > > ------- begin non-hanging invocation ----- > > [humphrey@zelda01 humphrey]$ mpiexec -d --mca btl_tcp_if_include eth0 > > -np 2 > > a.out > > [zelda01.localdomain:20659] procdir: (null) > > [zelda01.localdomain:20659] jobdir: (null) > > [zelda01.localdomain:20659] unidir: > > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe > > [zelda01.localdomain:20659] top: > > openmpi-sessions-humphrey@zelda01.localdomain_0 > > [zelda01.localdomain:20659] tmp: /tmp > > [zelda01.localdomain:20659] connect_uni: contact info read > > [zelda01.localdomain:20659] connect_uni: connection not allowed > > [zelda01.localdomain:20659] [0,0,0] setting up session dir with > > [zelda01.localdomain:20659] tmpdir /tmp > > [zelda01.localdomain:20659] universe default-universe-20659 > > [zelda01.localdomain:20659] user humphrey > > [zelda01.localdomain:20659] host zelda01.localdomain > > [zelda01.localdomain:20659] jobid 0 > > [zelda01.localdomain:20659] procid 0 > > [zelda01.localdomain:20659] procdir: > > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe > > -20659/ > > 0/0 > > [zelda01.localdomain:20659] jobdir: > > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe > > -20659/ > > 0 > > [zelda01.localdomain:20659] unidir: > > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe > > -20659 > > [zelda01.localdomain:20659] top: > > openmpi-sessions-humphrey@zelda01.localdomain_0 > > [zelda01.localdomain:20659] tmp: /tmp > > [zelda01.localdomain:20659] [0,0,0] contact_file > > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe > > -20659/ > > universe-setup.txt > > [zelda01.localdomain:20659] [0,0,0] wrote setup file > > [zelda01.localdomain:20659] pls:rsh: local csh: 0, local bash: 1 > > [zelda01.localdomain:20659] pls:rsh: assuming same remote shell as > > local > > shell > > [zelda01.localdomain:20659] pls:rsh: remote csh: 0, remote bash: 1 > > [zelda01.localdomain:20659] pls:rsh: final template argv: > > [zelda01.localdomain:20659] pls:rsh: ssh <template> orted --debug > > --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename > > <template> --universe > > humphrey@zelda01.localdomain:default-universe-20659 > > --nsreplica > > "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp:// > > 130.207.252.1 > > 31:35654" --gprreplica > > "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp:// > > 130.207.252.1 > > 31:35654" --mpi-call-yield 0 > > [zelda01.localdomain:20659] pls:rsh: launching on node localhost > > [zelda01.localdomain:20659] pls:rsh: oversubscribed -- setting > > mpi_yield_when_idle to 1 (1 2) > > [zelda01.localdomain:20659] pls:rsh: localhost is a LOCAL node > > [zelda01.localdomain:20659] pls:rsh: executing: orted --debug > > --bootproxy 1 > > --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost > > --universe > > humphrey@zelda01.localdomain:default-universe-20659 --nsreplica > > "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp:// > > 130.207.252.1 > > 31:35654" --gprreplica > > "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp:// > > 130.207.252.1 > > 31:35654" --mpi-call-yield 1 > > [zelda01.localdomain:20660] [0,0,1] setting up session dir with > > [zelda01.localdomain:20660] universe default-universe-20659 > > [zelda01.localdomain:20660] user humphrey > > [zelda01.localdomain:20660] host localhost > > [zelda01.localdomain:20660] jobid 0 > > [zelda01.localdomain:20660] procid 1 > > [zelda01.localdomain:20660] procdir: > > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659/0/1 > > [zelda01.localdomain:20660] jobdir: > > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659/0 > > [zelda01.localdomain:20660] unidir: > > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659 > > [zelda01.localdomain:20660] top: openmpi-sessions-humphrey@localhost_0 > > [zelda01.localdomain:20660] tmp: /tmp > > [zelda01.localdomain:20661] [0,1,1] setting up session dir with > > [zelda01.localdomain:20661] universe default-universe-20659 > > [zelda01.localdomain:20661] user humphrey > > [zelda01.localdomain:20661] host localhost > > [zelda01.localdomain:20661] jobid 1 > > [zelda01.localdomain:20661] procid 1 > > [zelda01.localdomain:20661] procdir: > > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659/1/1 > > [zelda01.localdomain:20661] jobdir: > > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659/1 > > [zelda01.localdomain:20661] unidir: > > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659 > > [zelda01.localdomain:20661] top: openmpi-sessions-humphrey@localhost_0 > > [zelda01.localdomain:20661] tmp: /tmp > > [zelda01.localdomain:20662] [0,1,0] setting up session dir with > > [zelda01.localdomain:20662] universe default-universe-20659 > > [zelda01.localdomain:20662] user humphrey > > [zelda01.localdomain:20662] host localhost > > [zelda01.localdomain:20662] jobid 1 > > [zelda01.localdomain:20662] procid 0 > > [zelda01.localdomain:20662] procdir: > > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659/1/0 > > [zelda01.localdomain:20662] jobdir: > > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659/1 > > [zelda01.localdomain:20662] unidir: > > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659 > > [zelda01.localdomain:20662] top: openmpi-sessions-humphrey@localhost_0 > > [zelda01.localdomain:20662] tmp: /tmp > > [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1, > > state = > > 0x3) > > [zelda01.localdomain:20659] Info: Setting up debugger process table for > > applications > > MPIR_being_debugged = 0 > > MPIR_debug_gate = 0 > > MPIR_debug_state = 1 > > MPIR_acquired_pre_main = 0 > > MPIR_i_am_starter = 0 > > MPIR_proctable_size = 2 > > MPIR_proctable: > > (i, host, exe, pid) = (0, localhost, /home/humphrey/a.out, 20661) > > (i, host, exe, pid) = (1, localhost, /home/humphrey/a.out, 20662) > > [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1, > > state = > > 0x4) > > [zelda01.localdomain:20662] [0,1,0] ompi_mpi_init completed > > [zelda01.localdomain:20661] [0,1,1] ompi_mpi_init completed > > > > 2 PE'S AS A 2 BY 1 GRID > > > > HALO2A NPES,N = 2 2 TIME = 0.000007 SECONDS > > HALO2A NPES,N = 2 4 TIME = 0.000007 SECONDS > > HALO2A NPES,N = 2 8 TIME = 0.000007 SECONDS > > HALO2A NPES,N = 2 16 TIME = 0.000008 SECONDS > > HALO2A NPES,N = 2 32 TIME = 0.000009 SECONDS > > HALO2A NPES,N = 2 64 TIME = 0.000011 SECONDS > > mpiexec: killing job... > > Interrupt > > Interrupt > > [zelda01.localdomain:20660] sess_dir_finalize: found proc session dir > > empty > > - deleting > > [zelda01.localdomain:20660] sess_dir_finalize: job session dir not > > empty - > > leaving > > [zelda01.localdomain:20660] sess_dir_finalize: found proc session dir > > empty > > - deleting > > [zelda01.localdomain:20660] sess_dir_finalize: found job session dir > > empty - > > deleting > > [zelda01.localdomain:20660] sess_dir_finalize: univ session dir not > > empty - > > leaving > > [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1, > > state = > > 0xa) > > [zelda01.localdomain:20660] orted: job_state_callback(jobid = 1, state > > = > > ORTE_PROC_STATE_ABORTED) > > [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1, > > state = > > 0x9) > > 2 processes killed (possibly by Open MPI) > > [zelda01.localdomain:20660] orted: job_state_callback(jobid = 1, state > > = > > ORTE_PROC_STATE_TERMINATED) > > [zelda01.localdomain:20660] sess_dir_finalize: found proc session dir > > empty > > - deleting > > [zelda01.localdomain:20660] sess_dir_finalize: found job session dir > > empty - > > deleting > > [zelda01.localdomain:20660] sess_dir_finalize: found univ session dir > > empty > > - deleting > > [zelda01.localdomain:20660] sess_dir_finalize: found top session dir > > empty - > > deleting > > [zelda01.localdomain:20659] sess_dir_finalize: found proc session dir > > empty > > - deleting > > [zelda01.localdomain:20659] sess_dir_finalize: found job session dir > > empty - > > deleting > > [zelda01.localdomain:20659] sess_dir_finalize: found univ session dir > > empty > > - deleting > > [zelda01.localdomain:20659] sess_dir_finalize: top session dir not > > empty - > > leaving > > [humphrey@zelda01 humphrey]$ > > -------- end non-hanging invocation ------ > > > > Any thoughts? > > > > -- Marty > > > >> -----Original Message----- > >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] > >> On > >> Behalf Of Jeff Squyres > >> Sent: Tuesday, November 01, 2005 2:17 PM > >> To: Open MPI Users > >> Subject: Re: [O-MPI users] can't get openmpi to run across two multi- > >> NICmachines > >> > >> On Nov 1, 2005, at 12:02 PM, Marty Humphrey wrote: > >> > >>> wukong: eth0 (152.48.249.102, no MPI traffic), eth1 > >>> (128.109.34.20,yes > >>> MPI > >>> traffic) > >>> zelda01: eth0 (130.207.252.131, yes MPI traffic), eth2 (10.0.0.12, no > >>> MPI > >>> traffic) > >>> > >>> on wukong, I have : > >>> [humphrey@wukong ~]$ more ~/.openmpi/mca-params.conf > >>> btl_tcp_if_include=eth1 > >>> on zelda01, I have : > >>> [humphrey@zelda01 humphrey]$ more ~/.openmpi/mca-params.conf > >>> btl_tcp_if_include=eth0 > >> > >> Just to make sure I'm reading this right -- 128.109.34.20 is supposed > >> to be routable to 130.207.252.131, right? Can you ssh directly from > >> one machine to the other? (I'm guessing that you can because OMPI was > >> able to start processes) Can you ping one machine from the other? > >> > >> Most importantly -- can you open arbitrary TCP ports between the two > >> machines? (i.e., not just well-known ports like 22 [ssh], etc.) > >> > >> -- > >> {+} Jeff Squyres > >> {+} The Open MPI Project > >> {+} http://www.open-mpi.org/ > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > {+} Jeff Squyres > {+} The Open MPI Project > {+} http://www.open-mpi.org/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users