On Thu, Jul 07, 2011 at 08:38:56PM -0400, Jeff Squyres wrote: > On Jul 5, 2011, at 4:24 PM, Steve Kargl wrote: > > On Tue, Jul 05, 2011 at 01:14:06PM -0700, Steve Kargl wrote: > >> I have an application that appears to function as I expect > >> when compiled with openmpi-1.4.2 on FreeBSD 9.0. But, it > >> appears to hang during communication between nodes. What > >> follows is the long version. > > > > Argh I messed up. It should read "But, it appears to hang > > during communication between nodes when using 1.4.3 or 1.4.4." > > > Are you able to run simple MPI applications with 1.4.3 or 1.4.4 > on your OS? E.g., the "ring_c" program in the example/ directory? > This might be a good test to see if OMPI's TCP is working at all. > > Assuming that works... Have you tried attaching debuggers to see > where your process is hanging? There might be a logic issue in > your app that isn't-quite-legal-MPI but works with some amount > of buffering, but fails if the amount of buffering is reduced.
It seems that openmpi-1.4.4 compiled code is trying to use the wrong nic. My /etc/hosts file has 10.208.78.111 hpc.apl.washington.edu hpc 192.168.0.10 node10.cimu.org node10 n10 master 192.168.0.11 node11.cimu.org node11 n11 192.168.0.12 node12.cimu.org node12 n12 ... down to ... 192.168.0.21 node21.cimu.org node21 n21 Note, node10 and hpc are the same system (2 different NICs). hpc:kargl[252] /usr/local/openmpi-1.4.4/bin/mpif90 -o z -g -O ring_f90.f90 hpc:kargl[253] cat > mf1 node10 slots=1 node11 slots=1 node12 slots=1 hpc:kargl[254] /usr/local/openmpi-1.4.4/bin/mpiexec -machinefile mf1 ./z Process 0 sending 10 to 1 tag 201 ( 3 processes in ring) in another xterm if I attach to the process on node10, I see with gdb. (gdb) bt #0 0x00000003c10f9b9c in kevent () from /lib/libc.so.7 #1 0x000000000052ca18 in kq_dispatch () #2 0x000000000052ba93 in opal_event_base_loop () #3 0x000000000052549b in opal_progress () #4 0x000000000048fcfc in mca_pml_ob1_send () #5 0x0000000000428873 in PMPI_Send () #6 0x000000000041a890 in pmpi_send__ () #7 0x000000000041a3f0 in ring () at ring_f90.f90:34 #8 0x000000000041a640 in main (argc=<value optimized out>, argv=<value optimized out>) at ring_f90.f90:10 #9 0x000000000041a1cc in _start () (gdb) quit Now, eliminating node10 from the machine file, I see: hpc:kargl[255] cat > mf2 node11 slots=1 node12 slots=1 node13 slots=1 hpc:kargl[256] /usr/local/openmpi-1.4.4/bin/mpiexec -machinefile mf2 ./z Process 0 sending 10 to 1 tag 201 ( 3 processes in ring) Process 0 sent to 1 Process 0 decremented value: 9 Process 0 decremented value: 8 Process 0 decremented value: 7 Process 0 decremented value: 6 Process 0 decremented value: 5 Process 0 decremented value: 4 Process 0 decremented value: 3 Process 0 decremented value: 2 Process 0 decremented value: 1 Process 0 decremented value: 0 Process 0 exiting Process 1 exiting Process 2 exiting I also have a simple mpi test program netmpi.c from Argonne. It shows hpc:kargl[263] /usr/local/openmpi-1.4.4/bin/mpicc -o z -g -O GetOpt.c netmpi.c hpc:kargl[264] cat mf_ompi_3 node11.cimu.org slots=1 node16.cimu.org slots=1 hpc:kargl[265] /usr/local/openmpi-1.4.4/bin/mpiexec -machinefile mf_ompi_3 ./z 1: node16.cimu.org 0: node11.cimu.org Latency: 0.000073617 Sync Time: 0.000147234 Now starting main loop 0: 0 bytes 16384 times --> 0.00 Mbps in 0.000073612 sec 1: 1 bytes 16384 times --> 0.10 Mbps in 0.000073612 sec 2: 2 bytes 3396 times --> 0.21 Mbps in 0.000073611 sec 3: 3 bytes 1698 times --> 0.31 Mbps in 0.000073609 sec 4: 5 bytes 2264 times --> 0.52 Mbps in 0.000073610 sec 5: 7 bytes 1358 times --> 0.73 Mbps in 0.000073608 sec hpc:kargl[268] cat mf_ompi_1 node10.cimu.org slots=1 node16.cimu.org slots=1 hpc:kargl[267] /usr/local/openmpi-1.4.4/bin/mpiexec -machinefile mf_ompi_1 ./z 0: hpc.apl.washington.edu 1: node16.cimu.org (gdb) bt #0 0x00000003c0bedb9c in kevent () from /lib/libc.so.7 #1 0x000000000052d648 in kq_dispatch () #2 0x000000000052c6c3 in opal_event_base_loop () #3 0x00000000005260cb in opal_progress () #4 0x0000000000491d1c in mca_pml_ob1_send () #5 0x000000000043c753 in PMPI_Send () #6 0x000000000041a112 in Sync (p=0x7fffffffd4d0) at netmpi.c:573 #7 0x000000000041a3cf in DetermineLatencyReps (p=0x3) at netmpi.c:593 #8 0x000000000041a4fe in TestLatency (p=0x3) at netmpi.c:630 #9 0x000000000041a958 in main (argc=1, argv=0x7fffffffd6a0) at netmpi.c:213 (gdb) quit Why is hpc.apl.washington.edu appearing instead of node10.cimu.org? -- Steve