Re: [OMPI users] tcp communication problems with 1.4.3 and 1.4.4 rc2 on FreeBSD

Steve Kargl Fri, 8 Jul 2011 14:48:46 -0400

On Fri, Jul 08, 2011 at 02:19:27PM -0400, Jeff Squyres wrote:
> On Jul 8, 2011, at 1:31 PM, Steve Kargl wrote:
> 
> > It seems that openmpi-1.4.4 compiled code is trying to use the
> > wrong nic.  My /etc/hosts file has
> > 
> > 10.208.78.111           hpc.apl.washington.edu hpc
> > 192.168.0.10            node10.cimu.org node10 n10 master
> > 192.168.0.11            node11.cimu.org node11 n11
> > 192.168.0.12            node12.cimu.org node12 n12
> > ... down to ...
> > 192.168.0.21            node21.cimu.org node21 n21
> > 
> > Note, node10 and hpc are the same system (2 different NICs). 
> 
> Don't confuse the machinefile with the NICs that OMPI will try
> to use.  The machinefile is only hosts on which OMPI will launch.
> Specifically: the machinefile does not influence which NICs OMPI
> will use for MPI communications.


Ah, okay.  I did not realize that a machinefile did not
limit OMPI to a set of IP address.

> > hpc:kargl[268] cat mf_ompi_1
> > node10.cimu.org slots=1
> > node16.cimu.org slots=1
> > hpc:kargl[267] /usr/local/openmpi-1.4.4/bin/mpiexec -machinefile mf_ompi_1 
> > ./z
> > 0: hpc.apl.washington.edu
> > 1: node16.cimu.org
> 
> What function is netmpi.c using to get the hostname that
> is printed?  It might be using MPI_Get_processor_name()
> or gethostname() -- both of which may return whatever hostname(1) returns.  

After reading the code, this appears to have misled me.  The
code uses MPI_Get_processor_name().

> > (gdb) bt
> > #0  0x00000003c0bedb9c in kevent () from /lib/libc.so.7
> > #1  0x000000000052d648 in kq_dispatch ()
> > #2  0x000000000052c6c3 in opal_event_base_loop ()
> > #3  0x00000000005260cb in opal_progress ()
> > #4  0x0000000000491d1c in mca_pml_ob1_send ()
> > #5  0x000000000043c753 in PMPI_Send ()
> > #6  0x000000000041a112 in Sync (p=0x7fffffffd4d0) at netmpi.c:573
> > #7  0x000000000041a3cf in DetermineLatencyReps (p=0x3) at netmpi.c:593
> > #8  0x000000000041a4fe in TestLatency (p=0x3) at netmpi.c:630
> > #9  0x000000000041a958 in main (argc=1, argv=0x7fffffffd6a0) at netmpi.c:213
> > (gdb) quit
> 
> The easiest way to fix this is likely to use the btl_tcp_if_include
> or btl_tcp_if_exclude MCA parameters -- i.e., tell OMPI exactly which
> interfaces to use:
> 
>     http://www.open-mpi.org/faq/?category=tcp#tcp-selection

Thanks for the pointer.  I'll try this solution later.

> Hypothetically, however, OMPI should be able to determine that
> 192.168.0.x is not reachable from the 10.x network (assuming
> your netmasks are set right), and automatically not use the
> 10.x network to reach any of the non-node10 machines.

The assumption is correct.  192.x is independent of 10.x.

> It's curious that this is not happening; I wonder if this
> is some kind of quirk of OMPI's reachability algorithms
> (http://www.open-mpi.org/faq/?category=tcp#tcp-routability)
>  on FreeBSD...?

I just rebuilt 1.4.4rc2 with '-O -g' to get debugging symbols
into openmpi's libraries and executables.  Is there any
particulare function(s) that I should inspect?

-- 
Steve

Re: [OMPI users] tcp communication problems with 1.4.3 and 1.4.4 rc2 on FreeBSD

Reply via email to