To close this thread for the web archives...

We iterated about this quite a bit off the list and fixed a pair of bugs that didn't make it into RC5. Many thanks to Marty for his patience in helping us fix this!

For those who care, the bugs were:

- The shared memory btl had a problem if mmap() returned different addresses to the same shared memory segment in different processes. - The TCP btl does subnet mask checking to help determine which IP addresses to hook up amongst peers (remember that Open MPI can utilize multiple TCP interfaces in a single job); there was a bug that did not allow arbitrary, non-subnet-mask-matched connections.

Both have been fixed in both the SVN trunk and v1.0 branch and are in the nightly snapshot tarballs from this morning.



On Nov 10, 2005, at 9:02 AM, Marty Humphrey wrote:

Here's a core I'm getting...

[humphrey@zelda01 humphrey]$ mpiexec --mca btl_tcp_if_include eth0 --mca
oob_tcp_include eth0  -np 2 a.out
mpiexec noticed that job rank 1 with PID 20028 on node "localhost" exited on
signal 11.
1 process killed (possibly by Open MPI)

[humphrey@zelda01 humphrey]$ gdb a.out core.20028
GNU gdb Red Hat Linux (6.3.0.0-1.62rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db
library "/lib/tls/libthread_db.so.1".

Core was generated by `a.out'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /home/humphrey/ompi-install/lib/libmpi.so.0...done.
Loaded symbols for /home/humphrey/ompi-install/lib/libmpi.so.0
Reading symbols from /home/humphrey/ompi-install/lib/liborte.so.0...done.
Loaded symbols for /home/humphrey/ompi-install/lib/liborte.so.0
Reading symbols from /home/humphrey/ompi-install/lib/libopal.so.0...done.
Loaded symbols for /home/humphrey/ompi-install/lib/libopal.so.0
Reading symbols from /lib/libutil.so.1...done.
Loaded symbols for /lib/libutil.so.1
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /usr/lib/libaio.so.1...done.
Loaded symbols for /usr/lib/libaio.so.1
Reading symbols from /usr/lib/libg2c.so.0...done.
Loaded symbols for /usr/lib/libg2c.so.0
Reading symbols from /lib/tls/libm.so.6...done.
Loaded symbols for /lib/tls/libm.so.6
Reading symbols from /lib/libgcc_s.so.1...done.
Loaded symbols for /lib/libgcc_s.so.1
Reading symbols from /lib/tls/libpthread.so.0...done.
Loaded symbols for /lib/tls/libpthread.so.0
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_paffinity_linux.so...done.
Loaded symbols for
/home/humphrey/ompi-install/lib/openmpi/mca_paffinity_linux.so
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_ns_proxy.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_ns_proxy.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_ns_replica.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_ns_replica.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_rml_oob.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_rml_oob.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_oob_tcp.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_oob_tcp.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_gpr_null.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_gpr_null.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_gpr_proxy.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_gpr_proxy.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_gpr_replica.so...done.
Loaded symbols for
/home/humphrey/ompi-install/lib/openmpi/mca_gpr_replica.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_rmgr_proxy.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_rmgr_proxy.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_rmgr_urm.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_rmgr_urm.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_rds_hostfile.so...done.
Loaded symbols for
/home/humphrey/ompi-install/lib/openmpi/mca_rds_hostfile.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_rds_resfile.so...done.
Loaded symbols for
/home/humphrey/ompi-install/lib/openmpi/mca_rds_resfile.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_ras_dash_host.so...done.
Loaded symbols for
/home/humphrey/ompi-install/lib/openmpi/mca_ras_dash_host.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_ras_hostfile.so...done.
Loaded symbols for
/home/humphrey/ompi-install/lib/openmpi/mca_ras_hostfile.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_ras_localhost.so...done.
Loaded symbols for
/home/humphrey/ompi-install/lib/openmpi/mca_ras_localhost.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_ras_slurm.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_ras_slurm.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/ mca_rmaps_round_robin.so...done.
Loaded symbols for
/home/humphrey/ompi-install/lib/openmpi/mca_rmaps_round_robin.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_pls_fork.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_pls_fork.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_pls_proxy.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_pls_proxy.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_pls_rsh.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_pls_rsh.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_pls_slurm.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_pls_slurm.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_iof_proxy.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_iof_proxy.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_allocator_basic.so...done.
Loaded symbols for
/home/humphrey/ompi-install/lib/openmpi/mca_allocator_basic.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_allocator_bucket.so...done.
Loaded symbols for
/home/humphrey/ompi-install/lib/openmpi/mca_allocator_bucket.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_rcache_rb.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_rcache_rb.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_mpool_sm.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_mpool_sm.so
Reading symbols from
/home/humphrey/ompi-install/lib/libmca_common_sm.so.0...done.
Loaded symbols for /home/humphrey/ompi-install/lib/libmca_common_sm.so.0
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_pml_ob1.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_pml_ob1.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_bml_r2.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_bml_r2.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_btl_self.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_btl_self.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_btl_sm.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_btl_sm.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_btl_tcp.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_btl_tcp.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_ptl_self.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_ptl_self.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_ptl_sm.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_ptl_sm.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_ptl_tcp.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_ptl_tcp.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_coll_basic.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_coll_basic.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_coll_hierarch.so...done.
Loaded symbols for
/home/humphrey/ompi-install/lib/openmpi/mca_coll_hierarch.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_coll_self.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_coll_self.so
Reading symbols from
/home/humphrey/ompi-install/lib/openmpi/mca_coll_sm.so...done.
Loaded symbols for /home/humphrey/ompi-install/lib/openmpi/mca_coll_sm.so
#0  0x009c4cbd in mca_btl_sm_add_procs_same_base_addr (btl=0x9c97c0,
nprocs=2, procs=0x8c28628, peers=0x8c28660,
    reachability=0xbfffde80) at btl_sm.c:412
412             mca_btl_sm_component.sm_ctl_header->segment_header.
(gdb) bt
#0  0x009c4cbd in mca_btl_sm_add_procs_same_base_addr (btl=0x9c97c0,
nprocs=2, procs=0x8c28628, peers=0x8c28660,
    reachability=0xbfffde80) at btl_sm.c:412
#1  0x005e7245 in mca_bml_r2_add_procs (nprocs=2, procs=0x8c28628,
bml_endpoints=0x8c28608, reachable=0xbfffde80) at bml_r2.c:220
#2  0x00323671 in mca_pml_ob1_add_procs (procs=0x8c285f8, nprocs=2) at
pml_ob1.c:131
#3  0x00ed6e81 in ompi_mpi_init (argc=0, argv=0x0, requested=0,
provided=0xbfffdf2c) at runtime/ompi_mpi_init.c:396
#4 0x00f00c62 in PMPI_Init (argc=0xbfffdf60, argv=0xbfffdf5c) at pinit.c:71
#5  0x00f2b23b in mpi_init_f (ierr=0x8052580) at pinit_f.c:65
#6  0x08049362 in MAIN__ () at Halo.f:19
#7  0x0804b7e6 in main ()
(gdb) up
#1  0x005e7245 in mca_bml_r2_add_procs (nprocs=2, procs=0x8c28628,
bml_endpoints=0x8c28608, reachable=0xbfffde80) at bml_r2.c:220
220             rc = btl->btl_add_procs(btl, n_new_procs, new_procs,
btl_endpoints, reachable);
(gdb) up
#2  0x00323671 in mca_pml_ob1_add_procs (procs=0x8c285f8, nprocs=2) at
pml_ob1.c:131
131         rc = mca_bml.bml_add_procs(
(gdb) up
#3  0x00ed6e81 in ompi_mpi_init (argc=0, argv=0x0, requested=0,
provided=0xbfffdf2c) at runtime/ompi_mpi_init.c:396
396         ret = MCA_PML_CALL(add_procs(procs, nprocs));
(gdb) up
#4 0x00f00c62 in PMPI_Init (argc=0xbfffdf60, argv=0xbfffdf5c) at pinit.c:71
71            err = ompi_mpi_init(*argc, *argv, required, &provided);
(gdb) up
#5  0x00f2b23b in mpi_init_f (ierr=0x8052580) at pinit_f.c:65
65          *ierr = OMPI_INT_2_FINT(MPI_Init( &argc, &argv ));
(gdb) up
#6  0x08049362 in MAIN__ () at Halo.f:19
19            CALL MPI_INIT(MPIERR)
Current language:  auto; currently fortran
(gdb)

-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Jeff Squyres
Sent: Wednesday, November 09, 2005 10:41 PM
To: Open MPI Users
Subject: Re: [O-MPI users] can't get openmpi to run across twomulti-
NICmachines

Sorry for the delay in replying -- it's a crazy week here preparing for
SC next week.

I'm double checking the code, and I don't see any obvious problems with
the btl tcp include stuff.

Can you also specify that you want OMPI's "out of band" communication
to use a specific network?

mpiexec -d --mca btl_tcp_if_include eth0 --mca oob_tcp_include eth0
-np 2 a.out

With the segv's, do you get meaningful core dumps?  Can you send
backtraces?



On Nov 8, 2005, at 3:02 PM, Marty Humphrey wrote:

It's taken me a while, but I've simplified the experiment...

In a nutshell, I'm seeing strange behavior in my multi-NIC box when I
attempt to execute " mpiexec -d --mca btl_tcp_if_include eth0  -np 2
a.out".
I have three different observed behaviors:

[1] mpi worker rank 0 displays the banner and then just hangs
(apparently
trying to exchange MPI messages, which don't get delivered)

2 PE'S AS A  2 BY  1 GRID

[2] it starts progressing (spitting out domain-specific msgs):

2 PE'S AS A  2 BY  1 GRID

  HALO2A  NPES,N =  2    2  TIME =  0.000007 SECONDS
  HALO2A  NPES,N =  2    4  TIME =  0.000007 SECONDS
  HALO2A  NPES,N =  2    8  TIME =  0.000007 SECONDS
  HALO2A  NPES,N =  2   16  TIME =  0.000008 SECONDS
  HALO2A  NPES,N =  2   32  TIME =  0.000009 SECONDS

[3] I get failure pretty quickly, with the line " mpiexec noticed that
job
rank 1 with PID 20425 on node "localhost" exited on signal 11."

Here's the output of "ifconfig":

[humphrey@zelda01 humphrey]$ /sbin/ifconfig
eth0      Link encap:Ethernet  HWaddr 00:11:43:DC:EA:EE
          inet addr:130.207.252.131  Bcast:130.207.252.255
Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2441905 errors:0 dropped:0 overruns:0 frame:0
          TX packets:112786 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:197322445 (188.1 Mb)  TX bytes:32906750 (31.3 Mb)
          Base address:0xecc0 Memory:dfae0000-dfb00000

eth2      Link encap:Ethernet  HWaddr 00:11:95:C7:28:82
          inet addr:10.0.0.11  Bcast:10.0.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:11598757 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7224590 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
RX bytes:3491651158 (3329.8 Mb) TX bytes:1916674000 (1827.8
Mb)
          Interrupt:77 Base address:0xcc00

ipsec0    Link encap:Ethernet  HWaddr 00:11:43:DC:EA:EE
          inet addr:130.207.252.131  Mask:255.255.255.0
          UP RUNNING NOARP  MTU:16260  Metric:1
          RX packets:40113 errors:0 dropped:40113 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:10
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:4742 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4742 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:2369841 (2.2 Mb)  TX bytes:2369841 (2.2 Mb)

This is with openmpi-1.1a1r8038 .

Here is the output of a hanging invocation....

----- begin hanging invocation ----
[humphrey@zelda01 humphrey]$ mpiexec -d --mca btl_tcp_if_include eth0
-np 2
a.out
[zelda01.localdomain:20455] procdir: (null)
[zelda01.localdomain:20455] jobdir: (null)
[zelda01.localdomain:20455] unidir:
/tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
[zelda01.localdomain:20455] top:
openmpi-sessions-humphrey@zelda01.localdomain_0
[zelda01.localdomain:20455] tmp: /tmp
[zelda01.localdomain:20455] connect_uni: contact info read
[zelda01.localdomain:20455] connect_uni: connection not allowed
[zelda01.localdomain:20455] [0,0,0] setting up session dir with
[zelda01.localdomain:20455]     tmpdir /tmp
[zelda01.localdomain:20455]     universe default-universe-20455
[zelda01.localdomain:20455]     user humphrey
[zelda01.localdomain:20455]     host zelda01.localdomain
[zelda01.localdomain:20455]     jobid 0
[zelda01.localdomain:20455]     procid 0
[zelda01.localdomain:20455] procdir:
/tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
-20455/
0/0
[zelda01.localdomain:20455] jobdir:
/tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
-20455/
0
[zelda01.localdomain:20455] unidir:
/tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
-20455
[zelda01.localdomain:20455] top:
openmpi-sessions-humphrey@zelda01.localdomain_0
[zelda01.localdomain:20455] tmp: /tmp
[zelda01.localdomain:20455] [0,0,0] contact_file
/tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
-20455/
universe-setup.txt
[zelda01.localdomain:20455] [0,0,0] wrote setup file
[zelda01.localdomain:20455] pls:rsh: local csh: 0, local bash: 1
[zelda01.localdomain:20455] pls:rsh: assuming same remote shell as
local
shell
[zelda01.localdomain:20455] pls:rsh: remote csh: 0, remote bash: 1
[zelda01.localdomain:20455] pls:rsh: final template argv:
[zelda01.localdomain:20455] pls:rsh:     ssh <template> orted --debug
--bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename
<template> --universe
humphrey@zelda01.localdomain:default-universe-20455
--nsreplica
"0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
130.207.252.1
31:35465" --gprreplica
"0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
130.207.252.1
31:35465" --mpi-call-yield 0
[zelda01.localdomain:20455] pls:rsh: launching on node localhost
[zelda01.localdomain:20455] pls:rsh: oversubscribed -- setting
mpi_yield_when_idle to 1 (1 2)
[zelda01.localdomain:20455] pls:rsh: localhost is a LOCAL node
[zelda01.localdomain:20455] pls:rsh: executing: orted --debug
--bootproxy 1
--name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost
--universe
humphrey@zelda01.localdomain:default-universe-20455 --nsreplica
"0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
130.207.252.1
31:35465" --gprreplica
"0.0.0;tcp://130.207.252.131:35465;tcp://10.0.0.11:35465;tcp://
130.207.252.1
31:35465" --mpi-call-yield 1
[zelda01.localdomain:20456] [0,0,1] setting up session dir with
[zelda01.localdomain:20456]     universe default-universe-20455
[zelda01.localdomain:20456]     user humphrey
[zelda01.localdomain:20456]     host localhost
[zelda01.localdomain:20456]     jobid 0
[zelda01.localdomain:20456]     procid 1
[zelda01.localdomain:20456] procdir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455/0/1
[zelda01.localdomain:20456] jobdir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455/0
[zelda01.localdomain:20456] unidir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455
[zelda01.localdomain:20456] top: openmpi-sessions-humphrey@localhost_0
[zelda01.localdomain:20456] tmp: /tmp
[zelda01.localdomain:20457] [0,1,1] setting up session dir with
[zelda01.localdomain:20457]     universe default-universe-20455
[zelda01.localdomain:20457]     user humphrey
[zelda01.localdomain:20457]     host localhost
[zelda01.localdomain:20457]     jobid 1
[zelda01.localdomain:20457]     procid 1
[zelda01.localdomain:20457] procdir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455/1/1
[zelda01.localdomain:20457] jobdir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455/1
[zelda01.localdomain:20457] unidir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455
[zelda01.localdomain:20457] top: openmpi-sessions-humphrey@localhost_0
[zelda01.localdomain:20457] tmp: /tmp
[zelda01.localdomain:20458] [0,1,0] setting up session dir with
[zelda01.localdomain:20458]     universe default-universe-20455
[zelda01.localdomain:20458]     user humphrey
[zelda01.localdomain:20458]     host localhost
[zelda01.localdomain:20458]     jobid 1
[zelda01.localdomain:20458]     procid 0
[zelda01.localdomain:20458] procdir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455/1/0
[zelda01.localdomain:20458] jobdir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455/1
[zelda01.localdomain:20458] unidir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20455
[zelda01.localdomain:20458] top: openmpi-sessions-humphrey@localhost_0
[zelda01.localdomain:20458] tmp: /tmp
[zelda01.localdomain:20455] spawn: in job_state_callback(jobid = 1,
state =
0x3)
[zelda01.localdomain:20455] Info: Setting up debugger process table for
applications
  MPIR_being_debugged = 0
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 2
  MPIR_proctable:
    (i, host, exe, pid) = (0, localhost, /home/humphrey/a.out, 20457)
    (i, host, exe, pid) = (1, localhost, /home/humphrey/a.out, 20458)
[zelda01.localdomain:20455] spawn: in job_state_callback(jobid = 1,
state =
0x4)
[zelda01.localdomain:20458] [0,1,0] ompi_mpi_init completed
[zelda01.localdomain:20457] [0,1,1] ompi_mpi_init completed

 2 PE'S AS A  2 BY  1 GRID
------ end hanging invocation  -----

Here's the 1-in-approximately-20 that started working...

------- begin non-hanging invocation -----
[humphrey@zelda01 humphrey]$ mpiexec -d --mca btl_tcp_if_include eth0
-np 2
a.out
[zelda01.localdomain:20659] procdir: (null)
[zelda01.localdomain:20659] jobdir: (null)
[zelda01.localdomain:20659] unidir:
/tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
[zelda01.localdomain:20659] top:
openmpi-sessions-humphrey@zelda01.localdomain_0
[zelda01.localdomain:20659] tmp: /tmp
[zelda01.localdomain:20659] connect_uni: contact info read
[zelda01.localdomain:20659] connect_uni: connection not allowed
[zelda01.localdomain:20659] [0,0,0] setting up session dir with
[zelda01.localdomain:20659]     tmpdir /tmp
[zelda01.localdomain:20659]     universe default-universe-20659
[zelda01.localdomain:20659]     user humphrey
[zelda01.localdomain:20659]     host zelda01.localdomain
[zelda01.localdomain:20659]     jobid 0
[zelda01.localdomain:20659]     procid 0
[zelda01.localdomain:20659] procdir:
/tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
-20659/
0/0
[zelda01.localdomain:20659] jobdir:
/tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
-20659/
0
[zelda01.localdomain:20659] unidir:
/tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
-20659
[zelda01.localdomain:20659] top:
openmpi-sessions-humphrey@zelda01.localdomain_0
[zelda01.localdomain:20659] tmp: /tmp
[zelda01.localdomain:20659] [0,0,0] contact_file
/tmp/openmpi-sessions-humphrey@zelda01.localdomain_0/default-universe
-20659/
universe-setup.txt
[zelda01.localdomain:20659] [0,0,0] wrote setup file
[zelda01.localdomain:20659] pls:rsh: local csh: 0, local bash: 1
[zelda01.localdomain:20659] pls:rsh: assuming same remote shell as
local
shell
[zelda01.localdomain:20659] pls:rsh: remote csh: 0, remote bash: 1
[zelda01.localdomain:20659] pls:rsh: final template argv:
[zelda01.localdomain:20659] pls:rsh:     ssh <template> orted --debug
--bootproxy 1 --name <template> --num_procs 2 --vpid_start 0 --nodename
<template> --universe
humphrey@zelda01.localdomain:default-universe-20659
--nsreplica
"0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
130.207.252.1
31:35654" --gprreplica
"0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
130.207.252.1
31:35654" --mpi-call-yield 0
[zelda01.localdomain:20659] pls:rsh: launching on node localhost
[zelda01.localdomain:20659] pls:rsh: oversubscribed -- setting
mpi_yield_when_idle to 1 (1 2)
[zelda01.localdomain:20659] pls:rsh: localhost is a LOCAL node
[zelda01.localdomain:20659] pls:rsh: executing: orted --debug
--bootproxy 1
--name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost
--universe
humphrey@zelda01.localdomain:default-universe-20659 --nsreplica
"0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
130.207.252.1
31:35654" --gprreplica
"0.0.0;tcp://130.207.252.131:35654;tcp://10.0.0.11:35654;tcp://
130.207.252.1
31:35654" --mpi-call-yield 1
[zelda01.localdomain:20660] [0,0,1] setting up session dir with
[zelda01.localdomain:20660]     universe default-universe-20659
[zelda01.localdomain:20660]     user humphrey
[zelda01.localdomain:20660]     host localhost
[zelda01.localdomain:20660]     jobid 0
[zelda01.localdomain:20660]     procid 1
[zelda01.localdomain:20660] procdir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659/0/1
[zelda01.localdomain:20660] jobdir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659/0
[zelda01.localdomain:20660] unidir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659
[zelda01.localdomain:20660] top: openmpi-sessions-humphrey@localhost_0
[zelda01.localdomain:20660] tmp: /tmp
[zelda01.localdomain:20661] [0,1,1] setting up session dir with
[zelda01.localdomain:20661]     universe default-universe-20659
[zelda01.localdomain:20661]     user humphrey
[zelda01.localdomain:20661]     host localhost
[zelda01.localdomain:20661]     jobid 1
[zelda01.localdomain:20661]     procid 1
[zelda01.localdomain:20661] procdir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659/1/1
[zelda01.localdomain:20661] jobdir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659/1
[zelda01.localdomain:20661] unidir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659
[zelda01.localdomain:20661] top: openmpi-sessions-humphrey@localhost_0
[zelda01.localdomain:20661] tmp: /tmp
[zelda01.localdomain:20662] [0,1,0] setting up session dir with
[zelda01.localdomain:20662]     universe default-universe-20659
[zelda01.localdomain:20662]     user humphrey
[zelda01.localdomain:20662]     host localhost
[zelda01.localdomain:20662]     jobid 1
[zelda01.localdomain:20662]     procid 0
[zelda01.localdomain:20662] procdir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659/1/0
[zelda01.localdomain:20662] jobdir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659/1
[zelda01.localdomain:20662] unidir:
/tmp/openmpi-sessions-humphrey@localhost_0/default-universe-20659
[zelda01.localdomain:20662] top: openmpi-sessions-humphrey@localhost_0
[zelda01.localdomain:20662] tmp: /tmp
[zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
state =
0x3)
[zelda01.localdomain:20659] Info: Setting up debugger process table for
applications
  MPIR_being_debugged = 0
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 2
  MPIR_proctable:
    (i, host, exe, pid) = (0, localhost, /home/humphrey/a.out, 20661)
    (i, host, exe, pid) = (1, localhost, /home/humphrey/a.out, 20662)
[zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
state =
0x4)
[zelda01.localdomain:20662] [0,1,0] ompi_mpi_init completed
[zelda01.localdomain:20661] [0,1,1] ompi_mpi_init completed

 2 PE'S AS A  2 BY  1 GRID

  HALO2A  NPES,N =  2    2  TIME =  0.000007 SECONDS
  HALO2A  NPES,N =  2    4  TIME =  0.000007 SECONDS
  HALO2A  NPES,N =  2    8  TIME =  0.000007 SECONDS
  HALO2A  NPES,N =  2   16  TIME =  0.000008 SECONDS
  HALO2A  NPES,N =  2   32  TIME =  0.000009 SECONDS
  HALO2A  NPES,N =  2   64  TIME =  0.000011 SECONDS
mpiexec: killing job...
Interrupt
Interrupt
[zelda01.localdomain:20660] sess_dir_finalize: found proc session dir
empty
- deleting
[zelda01.localdomain:20660] sess_dir_finalize: job session dir not
empty -
leaving
[zelda01.localdomain:20660] sess_dir_finalize: found proc session dir
empty
- deleting
[zelda01.localdomain:20660] sess_dir_finalize: found job session dir
empty -
deleting
[zelda01.localdomain:20660] sess_dir_finalize: univ session dir not
empty -
leaving
[zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
state =
0xa)
[zelda01.localdomain:20660] orted: job_state_callback(jobid = 1, state
=
ORTE_PROC_STATE_ABORTED)
[zelda01.localdomain:20659] spawn: in job_state_callback(jobid = 1,
state =
0x9)
2 processes killed (possibly by Open MPI)
[zelda01.localdomain:20660] orted: job_state_callback(jobid = 1, state
=
ORTE_PROC_STATE_TERMINATED)
[zelda01.localdomain:20660] sess_dir_finalize: found proc session dir
empty
- deleting
[zelda01.localdomain:20660] sess_dir_finalize: found job session dir
empty -
deleting
[zelda01.localdomain:20660] sess_dir_finalize: found univ session dir
empty
- deleting
[zelda01.localdomain:20660] sess_dir_finalize: found top session dir
empty -
deleting
[zelda01.localdomain:20659] sess_dir_finalize: found proc session dir
empty
- deleting
[zelda01.localdomain:20659] sess_dir_finalize: found job session dir
empty -
deleting
[zelda01.localdomain:20659] sess_dir_finalize: found univ session dir
empty
- deleting
[zelda01.localdomain:20659] sess_dir_finalize: top session dir not
empty -
leaving
[humphrey@zelda01 humphrey]$
-------- end non-hanging invocation ------

Any thoughts?

-- Marty

-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
On
Behalf Of Jeff Squyres
Sent: Tuesday, November 01, 2005 2:17 PM
To: Open MPI Users
Subject: Re: [O-MPI users] can't get openmpi to run across two multi-
NICmachines

On Nov 1, 2005, at 12:02 PM, Marty Humphrey wrote:

wukong: eth0 (152.48.249.102, no MPI traffic), eth1
(128.109.34.20,yes
MPI
traffic)
zelda01: eth0 (130.207.252.131, yes MPI traffic), eth2 (10.0.0.12, no
MPI
traffic)

on wukong, I have :
[humphrey@wukong ~]$ more ~/.openmpi/mca-params.conf
btl_tcp_if_include=eth1
on zelda01, I have :
[humphrey@zelda01 humphrey]$ more ~/.openmpi/mca-params.conf
btl_tcp_if_include=eth0

Just to make sure I'm reading this right -- 128.109.34.20 is supposed
to be routable to 130.207.252.131, right?  Can you ssh directly from
one machine to the other? (I'm guessing that you can because OMPI was
able to start processes)  Can you ping one machine from the other?

Most importantly -- can you open arbitrary TCP ports between the two
machines?  (i.e., not just well-known ports like 22 [ssh], etc.)

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

Reply via email to