On Fri, Jul 08, 2011 at 02:19:27PM -0400, Jeff Squyres wrote: > On Jul 8, 2011, at 1:31 PM, Steve Kargl wrote: > > > It seems that openmpi-1.4.4 compiled code is trying to use the > > wrong nic. My /etc/hosts file has > > > > 10.208.78.111 hpc.apl.washington.edu hpc > > 192.168.0.10 node10.cimu.org node10 n10 master > > 192.168.0.11 node11.cimu.org node11 n11 > > 192.168.0.12 node12.cimu.org node12 n12 > > ... down to ... > > 192.168.0.21 node21.cimu.org node21 n21 > > > > Note, node10 and hpc are the same system (2 different NICs). > > Don't confuse the machinefile with the NICs that OMPI will try > to use. The machinefile is only hosts on which OMPI will launch. > Specifically: the machinefile does not influence which NICs OMPI > will use for MPI communications.
Ah, okay. I did not realize that a machinefile did not limit OMPI to a set of IP address. > > hpc:kargl[268] cat mf_ompi_1 > > node10.cimu.org slots=1 > > node16.cimu.org slots=1 > > hpc:kargl[267] /usr/local/openmpi-1.4.4/bin/mpiexec -machinefile mf_ompi_1 > > ./z > > 0: hpc.apl.washington.edu > > 1: node16.cimu.org > > What function is netmpi.c using to get the hostname that > is printed? It might be using MPI_Get_processor_name() > or gethostname() -- both of which may return whatever hostname(1) returns. After reading the code, this appears to have misled me. The code uses MPI_Get_processor_name(). > > (gdb) bt > > #0 0x00000003c0bedb9c in kevent () from /lib/libc.so.7 > > #1 0x000000000052d648 in kq_dispatch () > > #2 0x000000000052c6c3 in opal_event_base_loop () > > #3 0x00000000005260cb in opal_progress () > > #4 0x0000000000491d1c in mca_pml_ob1_send () > > #5 0x000000000043c753 in PMPI_Send () > > #6 0x000000000041a112 in Sync (p=0x7fffffffd4d0) at netmpi.c:573 > > #7 0x000000000041a3cf in DetermineLatencyReps (p=0x3) at netmpi.c:593 > > #8 0x000000000041a4fe in TestLatency (p=0x3) at netmpi.c:630 > > #9 0x000000000041a958 in main (argc=1, argv=0x7fffffffd6a0) at netmpi.c:213 > > (gdb) quit > > The easiest way to fix this is likely to use the btl_tcp_if_include > or btl_tcp_if_exclude MCA parameters -- i.e., tell OMPI exactly which > interfaces to use: > > http://www.open-mpi.org/faq/?category=tcp#tcp-selection Thanks for the pointer. I'll try this solution later. > Hypothetically, however, OMPI should be able to determine that > 192.168.0.x is not reachable from the 10.x network (assuming > your netmasks are set right), and automatically not use the > 10.x network to reach any of the non-node10 machines. The assumption is correct. 192.x is independent of 10.x. > It's curious that this is not happening; I wonder if this > is some kind of quirk of OMPI's reachability algorithms > (http://www.open-mpi.org/faq/?category=tcp#tcp-routability) > on FreeBSD...? I just rebuilt 1.4.4rc2 with '-O -g' to get debugging symbols into openmpi's libraries and executables. Is there any particulare function(s) that I should inspect? -- Steve