On 2/1/07, Galen Shipman <gship...@lanl.gov> wrote:
What does ifconfig report on both nodes?

Hi Galen,

On headnode:
# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:11:43:EF:5D:6C
         inet addr:10.1.1.11  Bcast:10.1.1.255  Mask:255.255.255.0
         inet6 addr: fe80::211:43ff:feef:5d6c/64 Scope:Link
         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
         RX packets:279965 errors:0 dropped:0 overruns:0 frame:0
         TX packets:785652 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:1000
         RX bytes:28422663 (27.1 MiB)  TX bytes:999981228 (953.6 MiB)
         Base address:0xecc0 Memory:dfae0000-dfb00000

eth1      Link encap:Ethernet  HWaddr 00:11:43:EF:5D:6D
         inet addr:<public IP>  Bcast:172.25.238.255  Mask:255.255.255.0
         inet6 addr: fe80::211:43ff:feef:5d6d/64 Scope:Link
         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
         RX packets:1763252 errors:0 dropped:0 overruns:0 frame:0
         TX packets:133260 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:1000
         RX bytes:1726135418 (1.6 GiB)  TX bytes:40990369 (39.0 MiB)
         Base address:0xdcc0 Memory:df8e0000-df900000

ib0       Link encap:UNSPEC  HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
         inet addr:20.1.0.11  Bcast:20.1.0.255  Mask:255.255.255.0
         UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
         RX packets:9746 errors:0 dropped:0 overruns:0 frame:0
         TX packets:9746 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:128
         RX bytes:576988 (563.4 KiB)  TX bytes:462432 (451.5 KiB)

ib1       Link encap:UNSPEC  HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
         inet addr:30.5.0.11  Bcast:30.5.0.255  Mask:255.255.255.0
         UP BROADCAST MULTICAST  MTU:2044  Metric:1
         RX packets:0 errors:0 dropped:0 overruns:0 frame:0
         TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:128
         RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

on COMPUTE node:

# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:11:43:D1:C0:80
         inet addr:10.1.1.254  Bcast:10.1.1.255  Mask:255.255.255.0
         inet6 addr: fe80::211:43ff:fed1:c080/64 Scope:Link
         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
         RX packets:145725 errors:0 dropped:0 overruns:0 frame:0
         TX packets:85136 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:1000
         RX bytes:46506800 (44.3 MiB)  TX bytes:14722190 (14.0 MiB)
         Base address:0xbcc0 Memory:df7e0000-df800000

ib0       Link encap:UNSPEC  HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
         inet addr:20.1.0.254  Bcast:20.1.0.255  Mask:255.255.255.0
         UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
         RX packets:9773 errors:0 dropped:0 overruns:0 frame:0
         TX packets:9773 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:128
         RX bytes:424624 (414.6 KiB)  TX bytes:617676 (603.1 KiB)

ib1       Link encap:UNSPEC  HWaddr
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
         inet addr:30.5.0.254  Bcast:30.5.0.255  Mask:255.255.255.0
         UP BROADCAST MULTICAST  MTU:2044  Metric:1
         RX packets:0 errors:0 dropped:0 overruns:0 frame:0
         TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:128
         RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)


Additionally, I've discovered that this problem is specific to either
Dell hardware or Gig-E, because I cannot reproduce it in my VMware
cluster. Output of lspci for ethernet devices:
[headnode]# lspci |grep -i "ether"; ssh -x compute-0-0 'lspci |grep -i ether'
06:07.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit
Ethernet Controller (rev 05)
07:08.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit
Ethernet Controller (rev 05)
07:07.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit
Ethernet Controller (rev 05)

i.e. headnode has 2 gig-e interfaces and compute - one, and all are the same.

Thanks,
Alex.

On 2/1/07, Galen Shipman <gship...@lanl.gov> wrote:
What does ifconfig report on both nodes?

- Galen

On Feb 1, 2007, at 2:50 PM, Alex Tumanov wrote:

> Hi,
>
> I have kept doing my own investigation and recompiled OpenMPI to have
> only the barebones functionality with no support for any interconnects
> other than ethernet:
> # rpmbuild --rebuild --define="configure_options
> --prefix=/opt/openmpi/1.1.4" --define="install_in_opt 1"
> --define="mflags all" openmpi-1.1.4-1.src.rpm
>
> The error detailed in my previous message persisted, which eliminates
> the possibility of interconnect support interfering with ethernet
> support. Here's an excerpt from ompi_info:
> # ompi_info
>                 Open MPI: 1.1.4
>    Open MPI SVN revision: r13362
>                 Open RTE: 1.1.4
>    Open RTE SVN revision: r13362
>                     OPAL: 1.1.4
>        OPAL SVN revision: r13362
>                   Prefix: /opt/openmpi/1.1.4
>  Configured architecture: x86_64-redhat-linux-gnu
>              . . .
>           Thread support: posix (mpi: no, progress: no)
>   Internal debug support: no
>      MPI parameter check: runtime
>              . . .
>                 MCA btl: self (MCA v1.0, API v1.0, Component v1.1.4)
>                  MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.4)
>                  MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
>
> Again, to replicate the error, I ran
> # mpirun -hostfile ~/testdir/hosts  --mca btl tcp,self ~/testdir/hello
> In this case, you can even omit the runtime mca param specifications:
> # mpirun -hostfile ~/testdir/hosts  ~/testdir/hello
>
> Thanks for reading this. I hope I've provided enough information.
>
> Sincerely,
> Alex.
>
> On 2/1/07, Alex Tumanov <atuma...@gmail.com> wrote:
>> Hello,
>>
>> I have tried a very basic test on a 2 node "cluster" consisting of 2
>> dell boxes. One of them is dual CPU Intel(R) Xeon(TM) CPU 2.80GHz
>> with
>> 1GB of RAM and the slave node is quad-CPU Intel(R) Xeon(TM) CPU
>> 3.40GHz with 2GB of RAM. Both have Infiniband cards and Gig-E. The
>> slave node is connected directly to the headnode.
>>
>> OpenMPI version 1.1.4 was compiled with support for the following
>> btl's: openib,mx,gm, and mvapi. I got it to work over openib, but,
>> ironically, the same trivial hello world job fails over tcp (please
>> see the log below). I found that the same problem was already
>> discussed on this list here:
>> http://www.open-mpi.org/community/lists/users/2006/06/1347.php
>> The discussion mentioned that there could be something wrong with the
>> TCP setup of the nodes. Unfortunately it was taken offline. Could
>> someone help me with this?
>>
>> Thanks,
>> Alex.
>>
>> # mpirun -hostfile ~/testdir/hosts --mca btl tcp,self ~/testdir/hello
>> Hello from Alex' MPI test program
>> Process 0 on headnode out of 2
>> Hello from Alex' MPI test program
>> Process 1 on compute-0-0.local out of 2
>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
>> Failing at addr:0xdebdf8
>> [0] func:/opt/openmpi/1.1.4/lib/libopal.so.0 [0x2a9587e0e5]
>> [1] func:/lib64/tls/libpthread.so.0 [0x3d1a00c430]
>> [2] func:/opt/openmpi/1.1.4/lib/libopal.so.0 [0x2a95880729]
>> [3] func:/opt/openmpi/1.1.4/lib/libopal.so.0(_int_free+0x24a)
>> [0x2a95880d7a]
>> [4] func:/opt/openmpi/1.1.4/lib/libopal.so.0(free+0xbf)
>> [0x2a9588303f]
>> [5] func:/opt/openmpi/1.1.4/lib/libmpi.so.0 [0x2a955949ca]
>> [6] func:/opt/openmpi/1.1.4/lib/openmpi/mca_btl_tcp.so
>> (mca_btl_tcp_component_close+0x34f)
>> [0x2a988ee8ef]
>> [7] func:/opt/openmpi/1.1.4/lib/libopal.so.0
>> (mca_base_components_close+0xde)
>> [0x2a95872e1e]
>> [8] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_btl_base_close+0xe9)
>> [0x2a955e5159]
>> [9] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_bml_base_close+0x9)
>> [0x2a955e5029]
>> [10] func:/opt/openmpi/1.1.4/lib/openmpi/mca_pml_ob1.so
>> (mca_pml_ob1_component_close+0x25)
>> [0x2a97f4dc55]
>> [11] func:/opt/openmpi/1.1.4/lib/libopal.so.0
>> (mca_base_components_close+0xde)
>> [0x2a95872e1e]
>> [12] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_pml_base_close+0x69)
>> [0x2a955ea3e9]
>> [13] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(ompi_mpi_finalize+0xfe)
>> [0x2a955ab57e]
>> [14] func:/root/testdir/hello(main+0x7b) [0x4009d3]
>> [15] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x3d1951c3fb]
>> [16] func:/root/testdir/hello [0x4008ca]
>> *** End of error message ***
>> mpirun noticed that job rank 0 with PID 15573 on node "dr11.local"
>> exited on signal 11.
>> 2 additional processes aborted (not shown)
>>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to