You also might want to check that you don't have any firewalls between those nodes. This is a typical cause of what you describe.
On Jul 4, 2013, at 4:25 PM, Gustavo Correa <g...@ldeo.columbia.edu> wrote: > Hi Jed > > You could try to select only ethernet interface that match your node's IP > addresses, > which seems to be en2. > > The en1 interface seems to be an external IP. > Not sure about en3, but it is awkward that it has a > different IP than en2, but in the same subnet. > I wonder if this may be the reason for the program hanging. > > You may need to search all nodes ifconfig for a consistent set of > interfaces/IP addresses, > and tailor your mpiexec command line and your hostfile accordingly. > > Say, something like this: > > mpiexec -mca btl_tcp_if_include en2 -hostfile your_hostfile -np 43 ./ring_c > > See this FAQ (actually, all of them are very informative): > http://www.open-mpi.org/faq/?category=tcp#tcp-selection > > I hope this helps, > Gus Correa > > > > On Jul 4, 2013, at 6:37 PM, Jed O. Kaplan wrote: > >> Dear openmpi gurus, >> >> I am running openmpi 1.7.2 on a homogenous cluster of Apple XServes >> running OS X 10.6.8. My hardware nodes are connected through four >> gigabit ethernet connections; I have no infiniband or other high-speed >> interconnect. The problem I describe below is the same if I use openmpi >> 1.6.5. My openmpi installation is compiled with Intel icc and ifort. See >> the attached result of ompi_info --all for more details on my >> installation and runtime parameters, and other diagnostic information >> below >> >> My problem is that I noticed that inter-hardware communication hangs in >> one of my own programs; I thought this was the fault of my own bad >> programming, so I tried some of the example programs that are >> distributed with the openmpi source code. In the program "ring_*" using >> whichever of the APIs (c, cxx, fortran etc.), I have the same faulty >> behavior that I noticed in my own program: if I run the program on a >> single hardware node (with multiple processes) it works fine. As soon as >> I run the program across hardware nodes, it hangs. Below you will find >> an example of the program output and other diagnostic information. >> >> This problem has really frustrated me. Unfortunately I am not >> experienced enough with openmpi to get more into the debugging. >> >> Thank you in advance for any help you can give me! >> >> Jed Kaplan >> >> --- DETAILS OF MY PROBLEM --- >> >> -- this run works because it is only on one hardware node -- >> >> jkaplan@grkapsrv2:~/openmpi_examples > mpirun --prefix /usr/local >> --hostfile arvehosts.txt -np 3 ring_c >> Process 0 sending 10 to 1, tag 201 (3 processes in ring) >> Process 0 sent to 1 >> Process 0 decremented value: 9 >> Process 0 decremented value: 8 >> Process 0 decremented value: 7 >> Process 0 decremented value: 6 >> Process 0 decremented value: 5 >> Process 0 decremented value: 4 >> Process 0 decremented value: 3 >> Process 0 decremented value: 2 >> Process 0 decremented value: 1 >> Process 0 decremented value: 0 >> Process 0 exiting >> Process 1 exiting >> Process 2 exiting >> >> -- this run hangs when running over two hardware nodes -- >> >> jkaplan@grkapsrv2:~/openmpi_examples > mpirun --prefix /usr/local >> --hostfile arvehosts.txt -np 4 ring_c >> Process 0 sending 10 to 1, tag 201 (4 processes in ring) >> Process 0 sent to 1 >> Process 0 decremented value: 9 >> Process 0 decremented value: 8 >> ... hangs forever ... >> ^CKilled by signal 2. >> >> -- here is what my hostfile looks like -- >> >> jkaplan@grkapsrv2:~/openmpi_examples > cat arvehosts.txt >> #host file for ARVE group mac servers >> >> 10.0.0.21 slots=3 >> 10.0.0.31 slots=8 >> 10.0.0.41 slots=8 >> 10.0.0.51 slots=8 >> 10.0.0.61 slots=8 >> 10.0.0.71 slots=8 >> >> -- results of ifconfig - this looks pretty much the same on all of my >> servers, with different ip addresses of course -- >> >> jkaplan@grkapsrv2:~/openmpi_examples > ifconfig >> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384 >> inet6 ::1 prefixlen 128 >> inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 >> inet 127.0.0.1 netmask 0xff000000 >> gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280 >> stf0: flags=0<> mtu 1280 >> en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500 >> ether 00:24:36:f3:dc:fc >> inet6 fe80::224:36ff:fef3:dcfc%en0 prefixlen 64 scopeid 0x4 >> inet 128.178.107.85 netmask 0xffffff00 broadcast 128.178.107.255 >> media: autoselect (1000baseT <full-duplex>) >> status: active >> en1: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500 >> ether 00:24:36:f3:dc:fa >> inet6 fe80::224:36ff:fef3:dcfa%en1 prefixlen 64 scopeid 0x5 >> inet 10.0.0.2 netmask 0xff000000 broadcast 10.255.255.255 >> media: autoselect (1000baseT <full-duplex,flow-control>) >> status: active >> en2: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500 >> ether 00:24:36:f5:ba:4e >> inet6 fe80::224:36ff:fef5:ba4e%en2 prefixlen 64 scopeid 0x6 >> inet 10.0.0.21 netmask 0xff000000 broadcast 10.255.255.255 >> media: autoselect (1000baseT <full-duplex,flow-control>) >> status: active >> en3: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500 >> ether 00:24:36:f5:ba:4f >> inet6 fe80::224:36ff:fef5:ba4f%en3 prefixlen 64 scopeid 0x7 >> inet 10.0.0.22 netmask 0xff000000 broadcast 10.255.255.255 >> media: autoselect (1000baseT <full-duplex,flow-control>) >> status: active >> fw0: flags=8822<BROADCAST,SMART,SIMPLEX,MULTICAST> mtu 4078 >> lladdr 04:1e:64:ff:fe:f8:aa:d2 >> media: autoselect <full-duplex> >> status: inactive >> <ompi_info.txt>_______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users