By the way, it just *feels* like a race condition somewhere, because the
very next invocation worked (I ctrl-C'd it)...

[humphrey@zelda01 humphrey]$ mpiexec -d --mca btl_tcp_if_include eth0  --mca
oob_tcp_include eth0  -np 2 a.out
[zelda01.localdomain:19923] procdir: (null)
[zelda01.localdomain:19923] jobdir: (null)
[zelda01.localdomain:19923] unidir:
/tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
[zelda01.localdomain:19923] top:
openmpi-sessions-humphrey@zelda01.localdomain_0
[zelda01.localdomain:19923] tmp: /tmp
[zelda01.localdomain:19923] connect_uni: contact info read
[zelda01.localdomain:19923] connect_uni: connection not allowed
[zelda01.localdomain:19923] [0,0,0] setting up session dir with
[zelda01.localdomain:19923]     tmpdir /tmp
[zelda01.localdomain:19923]     universe default-universe-19923
[zelda01.localdomain:19923]     user humphrey
[zelda01.localdomain:19923]     host zelda01.localdomain
[zelda01.localdomain:19923]     jobid 0
[zelda01.localdomain:19923]     procid 0
[zelda01.localdomain:19923] procdir:
/tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe-19923/
0/0
[zelda01.localdomain:19923] jobdir:
/tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe-19923/
0
[zelda01.localdomain:19923] unidir:
/tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe-19923
[zelda01.localdomain:19923] top:
openmpi-sessions-humphrey@zelda01.localdomain_0
[zelda01.localdomain:19923] tmp: /tmp
[zelda01.localdomain:19923] [0,0,0] contact_file
/tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe-19923/
universe-setup.txt
[zelda01.localdomain:19923] [0,0,0] wrote setup file
[zelda01.localdomain:19923] spawn: in job_state_callback(jobid = 1, state =
0x1)
[zelda01.localdomain:19923] pls:rsh: local csh: 0, local bash: 1
[zelda01.localdomain:19923] pls:rsh: assuming same remote shell as local
shell
[zelda01.localdomain:19923] pls:rsh: remote csh: 0, remote bash: 1
[zelda01.localdomain:19923] pls:rsh: final template argv:
[zelda01.localdomain:19923] pls:rsh:     ssh <template> orted --debug
--bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename
<template> --universe humphrey@zelda01.localdomain:default-universe-19923
--nsreplica "0.0.0;tcp://130.207.252.131:35889" --gprreplica
"0.0.0;tcp://130.207.252.131:35889" --mpi-call-yield 0
[zelda01.localdomain:19923] pls:rsh: launching on node localhost
[zelda01.localdomain:19923] pls:rsh: oversubscribed -- setting
mpi_yield_when_idle to 1 (1 2)
[zelda01.localdomain:19923] pls:rsh: localhost is a LOCAL node
[zelda01.localdomain:19923] pls:rsh: executing: orted --debug --bootproxy 1
--name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost --universe
humphrey@zelda01.localdomain:default-universe-19923 --nsreplica
"0.0.0;tcp://130.207.252.131:35889" --gprreplica
"0.0.0;tcp://130.207.252.131:35889" --mpi-call-yield 1
[zelda01.localdomain:19924] [0,0,1] setting up session dir with
[zelda01.localdomain:19924]     universe default-universe-19923
[zelda01.localdomain:19924]     user humphrey
[zelda01.localdomain:19924]     host localhost
[zelda01.localdomain:19924]     jobid 0
[zelda01.localdomain:19924]     procid 1
[zelda01.localdomain:19924] procdir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-19923/0/1
[zelda01.localdomain:19924] jobdir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-19923/0
[zelda01.localdomain:19924] unidir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-19923
[zelda01.localdomain:19924] top: openmpi-sessions-humphrey@localhost_0
[zelda01.localdomain:19924] tmp: /tmp
[zelda01.localdomain:19925] [0,1,1] setting up session dir with
[zelda01.localdomain:19925]     universe default-universe-19923
[zelda01.localdomain:19925]     user humphrey
[zelda01.localdomain:19925]     host localhost
[zelda01.localdomain:19925]     jobid 1
[zelda01.localdomain:19925]     procid 1
[zelda01.localdomain:19925] procdir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-19923/1/1
[zelda01.localdomain:19925] jobdir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-19923/1
[zelda01.localdomain:19925] unidir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-19923
[zelda01.localdomain:19925] top: openmpi-sessions-humphrey@localhost_0
[zelda01.localdomain:19925] tmp: /tmp
[zelda01.localdomain:19926] [0,1,0] setting up session dir with
[zelda01.localdomain:19926]     universe default-universe-19923
[zelda01.localdomain:19926]     user humphrey
[zelda01.localdomain:19926]     host localhost
[zelda01.localdomain:19926]     jobid 1
[zelda01.localdomain:19926]     procid 0
[zelda01.localdomain:19926] procdir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-19923/1/0
[zelda01.localdomain:19926] jobdir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-19923/1
[zelda01.localdomain:19926] unidir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-19923
[zelda01.localdomain:19926] top: openmpi-sessions-humphrey@localhost_0
[zelda01.localdomain:19926] tmp: /tmp
[zelda01.localdomain:19923] spawn: in job_state_callback(jobid = 1, state =
0x3)
[zelda01.localdomain:19923] Info: Setting up debugger process table for
applications
  MPIR_being_debugged = 0
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 2
  MPIR_proctable:
    (i, host, exe, pid) = (0, localhost, /home/humphrey/a.out, 19925)
    (i, host, exe, pid) = (1, localhost, /home/humphrey/a.out, 19926)
[zelda01.localdomain:19923] spawn: in job_state_callback(jobid = 1, state =
0x4)
[zelda01.localdomain:19926] [0,1,0] ompi_mpi_init completed
[zelda01.localdomain:19925] [0,1,1] ompi_mpi_init completed

 2 PE'S AS A  2 BY  1 GRID

  HALO2A  NPES,N =  2    2  TIME =  0.000013 SECONDS
  HALO2A  NPES,N =  2    4  TIME =  0.000013 SECONDS
  HALO2A  NPES,N =  2    8  TIME =  0.000013 SECONDS
  HALO2A  NPES,N =  2   16  TIME =  0.000014 SECONDS
  HALO2A  NPES,N =  2   32  TIME =  0.000015 SECONDS
  HALO2A  NPES,N =  2   64  TIME =  0.000016 SECONDS
  HALO2A  NPES,N =  2  128  TIME =  0.000026 SECONDS
  HALO2A  NPES,N =  2  256  TIME =  0.000037 SECONDS
mpiexec: killing job...

> -----Original Message-----
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Jeff Squyres
> Sent: Wednesday, November 09, 2005 10:41 PM
> To: Open MPI Users
> Subject: Re: [O-MPI users] can't get openmpi to run across twomulti-
> NICmachines
> 
> Sorry for the delay in replying -- it's a crazy week here preparing for
> SC next week.
> 
> I'm double checking the code, and I don't see any obvious problems with
> the btl tcp include stuff.
> 
> Can you also specify that you want OMPI's "out of band" communication
> to use a specific network?
> 
> > mpiexec -d --mca btl_tcp_if_include eth0 --mca oob_tcp_include eth0
> > -np 2 a.out
> 
> With the segv's, do you get meaningful core dumps?  Can you send
> backtraces?
> 
> 
> 
> On Nov 8, 2005, at 3:02 PM, Marty Humphrey wrote:
> 
> > It's taken me a while, but I've simplified the experiment...
> >
> > In a nutshell, I'm seeing strange behavior in my multi-NIC box when I
> > attempt to execute " mpiexec -d --mca btl_tcp_if_include eth0  -np 2
> > a.out".
> > I have three different observed behaviors:
> >
> > [1] mpi worker rank 0 displays the banner and then just hangs
> > (apparently
> > trying to exchange MPI messages, which don't get delivered)
> >
> > 2 PE'S AS A  2 BY  1 GRID
> >
> > [2] it starts progressing (spitting out domain-specific msgs):
> >
> > 2 PE'S AS A  2 BY  1 GRID
> >
> >   HALO2A  NPES,N =  2    2  TIME =  0.000007 SECONDS
> >   HALO2A  NPES,N =  2    4  TIME =  0.000007 SECONDS
> >   HALO2A  NPES,N =  2    8  TIME =  0.000007 SECONDS
> >   HALO2A  NPES,N =  2   16  TIME =  0.000008 SECONDS
> >   HALO2A  NPES,N =  2   32  TIME =  0.000009 SECONDS
> >
> > [3] I get failure pretty quickly, with the line " mpiexec noticed that
> > job
> > rank 1 with PID 20425 on node "localhost" exited on signal 11."
> >
> > Here's the output of "ifconfig":
> >
> > [humphrey@zelda01 humphrey]$ /sbin/ifconfig
> > eth0      Link encap:Ethernet  HWaddr 00:11:43:DC:EA:EE
> >           inet addr:130.207.252.131  Bcast:130.207.252.255
> > Mask:255.255.255.0
> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >           RX packets:2441905 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:112786 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:1000
> >           RX bytes:197322445 (188.1 Mb)  TX bytes:32906750 (31.3 Mb)
> >           Base address:0xecc0 Memory:dfae0000-dfb00000
> >
> > eth2      Link encap:Ethernet  HWaddr 00:11:95:C7:28:82
> >           inet addr:10.0.0.11  Bcast:10.0.0.255  Mask:255.255.255.0
> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >           RX packets:11598757 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:7224590 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:1000
> >           RX bytes:3491651158 (3329.8 Mb)  TX bytes:1916674000 (1827.8
> > Mb)
> >           Interrupt:77 Base address:0xcc00
> >
> > ipsec0    Link encap:Ethernet  HWaddr 00:11:43:DC:EA:EE
> >           inet addr:130.207.252.131  Mask:255.255.255.0
> >           UP RUNNING NOARP  MTU:16260  Metric:1
> >           RX packets:40113 errors:0 dropped:40113 overruns:0 frame:0
> >           TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:10
> >           RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
> >
> > lo        Link encap:Local Loopback
> >           inet addr:127.0.0.1  Mask:255.0.0.0
> >           UP LOOPBACK RUNNING  MTU:16436  Metric:1
> >           RX packets:4742 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:4742 errors:0 dropped:0 overruns:0 carrier:0
> >           collisions:0 txqueuelen:0
> >           RX bytes:2369841 (2.2 Mb)  TX bytes:2369841 (2.2 Mb)
> >
> > This is with openmpi-1.1a1r8038 .
> >
> > Here is the output of a hanging invocation....
> >
> > ----- begin hanging invocation ----
> > [humphrey@zelda01 humphrey]$ mpiexec -d --mca btl_tcp_if_include eth0
> > -np 2
> > a.out
> > [zelda01.localdomain:20455] procdir: (null)
> > [zelda01.localdomain:20455] jobdir: (null)
> > [zelda01.localdomain:20455] unidir:
> > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
> > [zelda01.localdomain:20455] top:
> > openmpi-sessions-humphrey@zelda01.localdomain_0
> > [zelda01.localdomain:20455] tmp: /tmp
> > [zelda01.localdomain:20455] connect_uni: contact info read
> > [zelda01.localdomain:20455] connect_uni: connection not allowed
> > [zelda01.localdomain:20455] [0,0,0] setting up session dir with
> > [zelda01.localdomain:20455]     tmpdir /tmp
> > [zelda01.localdomain:20455]     universe default-universe-20455
> > [zelda01.localdomain:20455]     user humphrey
> > [zelda01.localdomain:20455]     host zelda01.localdomain
> > [zelda01.localdomain:20455]     jobid 0
> > [zelda01.localdomain:20455]     procid 0
> > [zelda01.localdomain:20455] procdir:
> > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
> > -20455/
> > 0/0
> > [zelda01.localdomain:20455] jobdir:
> > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
> > -20455/
> > 0
> > [zelda01.localdomain:20455] unidir:
> > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
> > -20455
> > [zelda01.localdomain:20455] top:
> > openmpi-sessions-humphrey@zelda01.localdomain_0
> > [zelda01.localdomain:20455] tmp: /tmp
> > [zelda01.localdomain:20455] [0,0,0] contact_file
> > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
> > -20455/
> > universe-setup.txt
> > [zelda01.localdomain:20455] [0,0,0] wrote setup file
> > [zelda01.localdomain:20455] pls:rsh: local csh: 0, local bash: 1
> > [zelda01.localdomain:20455] pls:rsh: assuming same remote shell as
> > local
> > shell
> > [zelda01.localdomain:20455] pls:rsh: remote csh: 0, remote bash: 1
> > [zelda01.localdomain:20455] pls:rsh: final template argv:
> > [zelda01.localdomain:20455] pls:rsh:     ssh <template> orted --debug
> > --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename
> > <template> --universe
> > humphrey@zelda01.localdomain:default-universe-20455
> > --nsreplica
> > "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
> > 130.207.252.1
> > 31:35465" --gprreplica
> > "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
> > 130.207.252.1
> > 31:35465" --mpi-call-yield 0
> > [zelda01.localdomain:20455] pls:rsh: launching on node localhost
> > [zelda01.localdomain:20455] pls:rsh: oversubscribed -- setting
> > mpi_yield_when_idle to 1 (1 2)
> > [zelda01.localdomain:20455] pls:rsh: localhost is a LOCAL node
> > [zelda01.localdomain:20455] pls:rsh: executing: orted --debug
> > --bootproxy 1
> > --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost
> > --universe
> > humphrey@zelda01.localdomain:default-universe-20455 --nsreplica
> > "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
> > 130.207.252.1
> > 31:35465" --gprreplica
> > "0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
> > 130.207.252.1
> > 31:35465" --mpi-call-yield 1
> > [zelda01.localdomain:20456] [0,0,1] setting up session dir with
> > [zelda01.localdomain:20456]     universe default-universe-20455
> > [zelda01.localdomain:20456]     user humphrey
> > [zelda01.localdomain:20456]     host localhost
> > [zelda01.localdomain:20456]     jobid 0
> > [zelda01.localdomain:20456]     procid 1
> > [zelda01.localdomain:20456] procdir:
> > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455/0/1
> > [zelda01.localdomain:20456] jobdir:
> > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455/0
> > [zelda01.localdomain:20456] unidir:
> > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455
> > [zelda01.localdomain:20456] top: openmpi-sessions-humphrey@localhost_0
> > [zelda01.localdomain:20456] tmp: /tmp
> > [zelda01.localdomain:20457] [0,1,1] setting up session dir with
> > [zelda01.localdomain:20457]     universe default-universe-20455
> > [zelda01.localdomain:20457]     user humphrey
> > [zelda01.localdomain:20457]     host localhost
> > [zelda01.localdomain:20457]     jobid 1
> > [zelda01.localdomain:20457]     procid 1
> > [zelda01.localdomain:20457] procdir:
> > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455/1/1
> > [zelda01.localdomain:20457] jobdir:
> > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455/1
> > [zelda01.localdomain:20457] unidir:
> > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455
> > [zelda01.localdomain:20457] top: openmpi-sessions-humphrey@localhost_0
> > [zelda01.localdomain:20457] tmp: /tmp
> > [zelda01.localdomain:20458] [0,1,0] setting up session dir with
> > [zelda01.localdomain:20458]     universe default-universe-20455
> > [zelda01.localdomain:20458]     user humphrey
> > [zelda01.localdomain:20458]     host localhost
> > [zelda01.localdomain:20458]     jobid 1
> > [zelda01.localdomain:20458]     procid 0
> > [zelda01.localdomain:20458] procdir:
> > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455/1/0
> > [zelda01.localdomain:20458] jobdir:
> > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455/1
> > [zelda01.localdomain:20458] unidir:
> > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455
> > [zelda01.localdomain:20458] top: openmpi-sessions-humphrey@localhost_0
> > [zelda01.localdomain:20458] tmp: /tmp
> > [zelda01.localdomain:20455] spawn: in job_state_callback(jobid = 1,
> > state =
> > 0x3)
> > [zelda01.localdomain:20455] Info: Setting up debugger process table for
> > applications
> >   MPIR_being_debugged = 0
> >   MPIR_debug_gate = 0
> >   MPIR_debug_state = 1
> >   MPIR_acquired_pre_main = 0
> >   MPIR_i_am_starter = 0
> >   MPIR_proctable_size = 2
> >   MPIR_proctable:
> >     (i, host, exe, pid) = (0, localhost, /home/humphrey/a.out, 20457)
> >     (i, host, exe, pid) = (1, localhost, /home/humphrey/a.out, 20458)
> > [zelda01.localdomain:20455] spawn: in job_state_callback(jobid = 1,
> > state =
> > 0x4)
> > [zelda01.localdomain:20458] [0,1,0] ompi_mpi_init completed
> > [zelda01.localdomain:20457] [0,1,1] ompi_mpi_init completed
> >
> >  2 PE'S AS A  2 BY  1 GRID
> > ------ end hanging invocation  -----
> >
> > Here's the 1-in-approximately-20 that started working...
> >
> > ------- begin non-hanging invocation -----
> > [humphrey@zelda01 humphrey]$ mpiexec -d --mca btl_tcp_if_include eth0
> > -np 2
> > a.out
> > [zelda01.localdomain:20659] procdir: (null)
> > [zelda01.localdomain:20659] jobdir: (null)
> > [zelda01.localdomain:20659] unidir:
> > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
> > [zelda01.localdomain:20659] top:
> > openmpi-sessions-humphrey@zelda01.localdomain_0
> > [zelda01.localdomain:20659] tmp: /tmp
> > [zelda01.localdomain:20659] connect_uni: contact info read
> > [zelda01.localdomain:20659] connect_uni: connection not allowed
> > [zelda01.localdomain:20659] [0,0,0] setting up session dir with
> > [zelda01.localdomain:20659]     tmpdir /tmp
> > [zelda01.localdomain:20659]     universe default-universe-20659
> > [zelda01.localdomain:20659]     user humphrey
> > [zelda01.localdomain:20659]     host zelda01.localdomain
> > [zelda01.localdomain:20659]     jobid 0
> > [zelda01.localdomain:20659]     procid 0
> > [zelda01.localdomain:20659] procdir:
> > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
> > -20659/
> > 0/0
> > [zelda01.localdomain:20659] jobdir:
> > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
> > -20659/
> > 0
> > [zelda01.localdomain:20659] unidir:
> > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
> > -20659
> > [zelda01.localdomain:20659] top:
> > openmpi-sessions-humphrey@zelda01.localdomain_0
> > [zelda01.localdomain:20659] tmp: /tmp
> > [zelda01.localdomain:20659] [0,0,0] contact_file
> > /tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
> > -20659/
> > universe-setup.txt
> > [zelda01.localdomain:20659] [0,0,0] wrote setup file
> > [zelda01.localdomain:20659] pls:rsh: local csh: 0, local bash: 1
> > [zelda01.localdomain:20659] pls:rsh: assuming same remote shell as
> > local
> > shell
> > [zelda01.localdomain:20659] pls:rsh: remote csh: 0, remote bash: 1
> > [zelda01.localdomain:20659] pls:rsh: final template argv:
> > [zelda01.localdomain:20659] pls:rsh:     ssh <template> orted --debug
> > --bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename
> > <template> --universe
> > humphrey@zelda01.localdomain:default-universe-20659
> > --nsreplica
> > "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
> > 130.207.252.1
> > 31:35654" --gprreplica
> > "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
> > 130.207.252.1
> > 31:35654" --mpi-call-yield 0
> > [zelda01.localdomain:20659] pls:rsh: launching on node localhost
> > [zelda01.localdomain:20659] pls:rsh: oversubscribed -- setting
> > mpi_yield_when_idle to 1 (1 2)
> > [zelda01.localdomain:20659] pls:rsh: localhost is a LOCAL node
> > [zelda01.localdomain:20659] pls:rsh: executing: orted --debug
> > --bootproxy 1
> > --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost
> > --universe
> > humphrey@zelda01.localdomain:default-universe-20659 --nsreplica
> > "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
> > 130.207.252.1
> > 31:35654" --gprreplica
> > "0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
> > 130.207.252.1
> > 31:35654" --mpi-call-yield 1
> > [zelda01.localdomain:20660] [0,0,1] setting up session dir with
> > [zelda01.localdomain:20660]     universe default-universe-20659
> > [zelda01.localdomain:20660]     user humphrey
> > [zelda01.localdomain:20660]     host localhost
> > [zelda01.localdomain:20660]     jobid 0
> > [zelda01.localdomain:20660]     procid 1
> > [zelda01.localdomain:20660] procdir:
> > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659/0/1
> > [zelda01.localdomain:20660] jobdir:
> > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659/0
> > [zelda01.localdomain:20660] unidir:
> > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659
> > [zelda01.localdomain:20660] top: openmpi-sessions-humphrey@localhost_0
> > [zelda01.localdomain:20660] tmp: /tmp
> > [zelda01.localdomain:20661] [0,1,1] setting up session dir with
> > [zelda01.localdomain:20661]     universe default-universe-20659
> > [zelda01.localdomain:20661]     user humphrey
> > [zelda01.localdomain:20661]     host localhost
> > [zelda01.localdomain:20661]     jobid 1
> > [zelda01.localdomain:20661]     procid 1
> > [zelda01.localdomain:20661] procdir:
> > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659/1/1
> > [zelda01.localdomain:20661] jobdir:
> > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659/1
> > [zelda01.localdomain:20661] unidir:
> > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659
> > [zelda01.localdomain:20661] top: openmpi-sessions-humphrey@localhost_0
> > [zelda01.localdomain:20661] tmp: /tmp
> > [zelda01.localdomain:20662] [0,1,0] setting up session dir with
> > [zelda01.localdomain:20662]     universe default-universe-20659
> > [zelda01.localdomain:20662]     user humphrey
> > [zelda01.localdomain:20662]     host localhost
> > [zelda01.localdomain:20662]     jobid 1
> > [zelda01.localdomain:20662]     procid 0
> > [zelda01.localdomain:20662] procdir:
> > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659/1/0
> > [zelda01.localdomain:20662] jobdir:
> > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659/1
> > [zelda01.localdomain:20662] unidir:
> > /tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659
> > [zelda01.localdomain:20662] top: openmpi-sessions-humphrey@localhost_0
> > [zelda01.localdomain:20662] tmp: /tmp
> > [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
> > state =
> > 0x3)
> > [zelda01.localdomain:20659] Info: Setting up debugger process table for
> > applications
> >   MPIR_being_debugged = 0
> >   MPIR_debug_gate = 0
> >   MPIR_debug_state = 1
> >   MPIR_acquired_pre_main = 0
> >   MPIR_i_am_starter = 0
> >   MPIR_proctable_size = 2
> >   MPIR_proctable:
> >     (i, host, exe, pid) = (0, localhost, /home/humphrey/a.out, 20661)
> >     (i, host, exe, pid) = (1, localhost, /home/humphrey/a.out, 20662)
> > [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
> > state =
> > 0x4)
> > [zelda01.localdomain:20662] [0,1,0] ompi_mpi_init completed
> > [zelda01.localdomain:20661] [0,1,1] ompi_mpi_init completed
> >
> >  2 PE'S AS A  2 BY  1 GRID
> >
> >   HALO2A  NPES,N =  2    2  TIME =  0.000007 SECONDS
> >   HALO2A  NPES,N =  2    4  TIME =  0.000007 SECONDS
> >   HALO2A  NPES,N =  2    8  TIME =  0.000007 SECONDS
> >   HALO2A  NPES,N =  2   16  TIME =  0.000008 SECONDS
> >   HALO2A  NPES,N =  2   32  TIME =  0.000009 SECONDS
> >   HALO2A  NPES,N =  2   64  TIME =  0.000011 SECONDS
> > mpiexec: killing job...
> > Interrupt
> > Interrupt
> > [zelda01.localdomain:20660] sess_dir_finalize: found proc session dir
> > empty
> > - deleting
> > [zelda01.localdomain:20660] sess_dir_finalize: job session dir not
> > empty -
> > leaving
> > [zelda01.localdomain:20660] sess_dir_finalize: found proc session dir
> > empty
> > - deleting
> > [zelda01.localdomain:20660] sess_dir_finalize: found job session dir
> > empty -
> > deleting
> > [zelda01.localdomain:20660] sess_dir_finalize: univ session dir not
> > empty -
> > leaving
> > [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
> > state =
> > 0xa)
> > [zelda01.localdomain:20660] orted: job_state_callback(jobid = 1, state
> > =
> > ORTE_PROC_STATE_ABORTED)
> > [zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
> > state =
> > 0x9)
> > 2 processes killed (possibly by Open MPI)
> > [zelda01.localdomain:20660] orted: job_state_callback(jobid = 1, state
> > =
> > ORTE_PROC_STATE_TERMINATED)
> > [zelda01.localdomain:20660] sess_dir_finalize: found proc session dir
> > empty
> > - deleting
> > [zelda01.localdomain:20660] sess_dir_finalize: found job session dir
> > empty -
> > deleting
> > [zelda01.localdomain:20660] sess_dir_finalize: found univ session dir
> > empty
> > - deleting
> > [zelda01.localdomain:20660] sess_dir_finalize: found top session dir
> > empty -
> > deleting
> > [zelda01.localdomain:20659] sess_dir_finalize: found proc session dir
> > empty
> > - deleting
> > [zelda01.localdomain:20659] sess_dir_finalize: found job session dir
> > empty -
> > deleting
> > [zelda01.localdomain:20659] sess_dir_finalize: found univ session dir
> > empty
> > - deleting
> > [zelda01.localdomain:20659] sess_dir_finalize: top session dir not
> > empty -
> > leaving
> > [humphrey@zelda01 humphrey]$
> > -------- end non-hanging invocation ------
> >
> > Any thoughts?
> >
> > -- Marty
> >
> >> -----Original Message-----
> >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
> >> On
> >> Behalf Of Jeff Squyres
> >> Sent: Tuesday, November 01, 2005 2:17 PM
> >> To: Open MPI Users
> >> Subject: Re: [O-MPI users] can't get openmpi to run across two multi-
> >> NICmachines
> >>
> >> On Nov 1, 2005, at 12:02 PM, Marty Humphrey wrote:
> >>
> >>> wukong: eth0 (152.48.249.102, no MPI traffic), eth1
> >>> (128.109.34.20,yes
> >>> MPI
> >>> traffic)
> >>> zelda01: eth0 (130.207.252.131, yes MPI traffic), eth2 (10.0.0.12, no
> >>> MPI
> >>> traffic)
> >>>
> >>> on wukong, I have :
> >>> [humphrey@wukong ~]$ more ~/.openmpi/mca-params.conf
> >>> btl_tcp_if_include=eth1
> >>> on zelda01, I have :
> >>> [humphrey@zelda01 humphrey]$ more ~/.openmpi/mca-params.conf
> >>> btl_tcp_if_include=eth0
> >>
> >> Just to make sure I'm reading this right -- 128.109.34.20 is supposed
> >> to be routable to 130.207.252.131, right?  Can you ssh directly from
> >> one machine to the other?  (I'm guessing that you can because OMPI was
> >> able to start processes)  Can you ping one machine from the other?
> >>
> >> Most importantly -- can you open arbitrary TCP ports between the two
> >> machines?  (i.e., not just well-known ports like 22 [ssh], etc.)
> >>
> >> --
> >> {+} Jeff Squyres
> >> {+} The Open MPI Project
> >> {+} http://www.open-mpi.org/
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> 
> --
> {+} Jeff Squyres
> {+} The Open MPI Project
> {+} http://www.open-mpi.org/
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to