You also might want to check that you don't have any firewalls between those 
nodes. This is a typical cause of what you describe.


On Jul 4, 2013, at 4:25 PM, Gustavo Correa <g...@ldeo.columbia.edu> wrote:

> Hi Jed 
> 
> You could try to select only ethernet interface that match your node's IP 
> addresses,
> which seems to be en2.
> 
> The en1 interface seems to be an external IP. 
> Not sure about en3, but it is awkward that it has a 
> different IP than en2, but in the same subnet.
> I wonder if this may be the reason for the program hanging.
> 
> You may need to search all nodes ifconfig for a consistent set of 
> interfaces/IP addresses,
> and tailor your mpiexec command line and your hostfile accordingly.
> 
> Say, something like this:
> 
> mpiexec -mca btl_tcp_if_include en2 -hostfile your_hostfile -np 43 ./ring_c
> 
> See this FAQ (actually, all of them are very informative):
> http://www.open-mpi.org/faq/?category=tcp#tcp-selection
> 
> I hope this helps,
> Gus Correa
> 
> 
> 
> On Jul 4, 2013, at 6:37 PM, Jed O. Kaplan wrote:
> 
>> Dear openmpi gurus,
>> 
>> I am running openmpi 1.7.2 on a homogenous cluster of Apple XServes
>> running OS X 10.6.8. My hardware nodes are connected through four
>> gigabit ethernet connections; I have no infiniband or other high-speed
>> interconnect. The problem I describe below is the same if I use openmpi
>> 1.6.5. My openmpi installation is compiled with Intel icc and ifort. See
>> the attached result of ompi_info --all for more details on my
>> installation and runtime parameters, and other diagnostic information
>> below
>> 
>> My problem is that I noticed that inter-hardware communication hangs in
>> one of my own programs; I thought this was the fault of my own bad
>> programming, so I tried some of the example programs that are
>> distributed with the openmpi source code. In the program "ring_*" using
>> whichever of the APIs (c, cxx, fortran etc.), I have the same faulty
>> behavior that I noticed in my own program: if I run the program on a
>> single hardware node (with multiple processes) it works fine. As soon as
>> I run the program across hardware nodes, it hangs. Below you will find
>> an example of the program output and other diagnostic information.
>> 
>> This problem has really frustrated me. Unfortunately I am not
>> experienced enough with openmpi to get more into the debugging.
>> 
>> Thank you in advance for any help you can give me!
>> 
>> Jed Kaplan
>> 
>> --- DETAILS OF MY PROBLEM ---
>> 
>> -- this run works because it is only on one hardware node --
>> 
>> jkaplan@grkapsrv2:~/openmpi_examples >  mpirun --prefix /usr/local
>> --hostfile arvehosts.txt -np 3 ring_c
>> Process 0 sending 10 to 1, tag 201 (3 processes in ring)
>> Process 0 sent to 1
>> Process 0 decremented value: 9
>> Process 0 decremented value: 8
>> Process 0 decremented value: 7
>> Process 0 decremented value: 6
>> Process 0 decremented value: 5
>> Process 0 decremented value: 4
>> Process 0 decremented value: 3
>> Process 0 decremented value: 2
>> Process 0 decremented value: 1
>> Process 0 decremented value: 0
>> Process 0 exiting
>> Process 1 exiting
>> Process 2 exiting
>> 
>> -- this run hangs when running over two hardware nodes --
>> 
>> jkaplan@grkapsrv2:~/openmpi_examples >  mpirun --prefix /usr/local
>> --hostfile arvehosts.txt -np 4 ring_c
>> Process 0 sending 10 to 1, tag 201 (4 processes in ring)
>> Process 0 sent to 1
>> Process 0 decremented value: 9
>> Process 0 decremented value: 8
>> ... hangs forever ...
>> ^CKilled by signal 2.
>> 
>> -- here is what my hostfile looks like --
>> 
>> jkaplan@grkapsrv2:~/openmpi_examples > cat arvehosts.txt 
>> #host file for ARVE group mac servers
>> 
>> 10.0.0.21 slots=3
>> 10.0.0.31 slots=8
>> 10.0.0.41 slots=8
>> 10.0.0.51 slots=8
>> 10.0.0.61 slots=8 
>> 10.0.0.71 slots=8
>> 
>> -- results of ifconfig - this looks pretty much the same on all of my
>> servers, with different ip addresses of course --
>> 
>> jkaplan@grkapsrv2:~/openmpi_examples > ifconfig
>> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
>>      inet6 ::1 prefixlen 128 
>>      inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 
>>      inet 127.0.0.1 netmask 0xff000000 
>> gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280
>> stf0: flags=0<> mtu 1280
>> en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
>>      ether 00:24:36:f3:dc:fc 
>>      inet6 fe80::224:36ff:fef3:dcfc%en0 prefixlen 64 scopeid 0x4 
>>      inet 128.178.107.85 netmask 0xffffff00 broadcast 128.178.107.255
>>      media: autoselect (1000baseT <full-duplex>)
>>      status: active
>> en1: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
>>      ether 00:24:36:f3:dc:fa 
>>      inet6 fe80::224:36ff:fef3:dcfa%en1 prefixlen 64 scopeid 0x5 
>>      inet 10.0.0.2 netmask 0xff000000 broadcast 10.255.255.255
>>      media: autoselect (1000baseT <full-duplex,flow-control>)
>>      status: active
>> en2: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
>>      ether 00:24:36:f5:ba:4e 
>>      inet6 fe80::224:36ff:fef5:ba4e%en2 prefixlen 64 scopeid 0x6 
>>      inet 10.0.0.21 netmask 0xff000000 broadcast 10.255.255.255
>>      media: autoselect (1000baseT <full-duplex,flow-control>)
>>      status: active
>> en3: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
>>      ether 00:24:36:f5:ba:4f 
>>      inet6 fe80::224:36ff:fef5:ba4f%en3 prefixlen 64 scopeid 0x7 
>>      inet 10.0.0.22 netmask 0xff000000 broadcast 10.255.255.255
>>      media: autoselect (1000baseT <full-duplex,flow-control>)
>>      status: active
>> fw0: flags=8822<BROADCAST,SMART,SIMPLEX,MULTICAST> mtu 4078
>>      lladdr 04:1e:64:ff:fe:f8:aa:d2 
>>      media: autoselect <full-duplex>
>>      status: inactive
>> <ompi_info.txt>_______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to