(only replying to users list)

Some suggestions:

- MPI seems to startup but the additional TCP connections required for MPI connections seem to be failing / timing out / some other error. - Are you running firewalls between your machines? If so, can you disable them? - I see that you're specifying "--mca btl_tcp_port_min_v4 36900" but one of the debug lines reads:
[apex-backpack:31956] btl: tcp: attempting to connect() to address
10.11.14.203 on port 9360
- Try not using the name "localhost", but rather the IP address of the local machine


On Sep 22, 2009, at 5:27 PM, Pallab Datta wrote:

The following are the ifconfig for both the Mac and the Linux respectively:

fuji:openmpi-1.3.3 pallabdatta$ ifconfig
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
        inet 127.0.0.1 netmask 0xff000000
        inet6 ::1 prefixlen 128
gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280
stf0: flags=0<> mtu 1280
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        inet6 fe80::21f:5bff:fe3d:eaac%en0 prefixlen 64 scopeid 0x4
        inet 10.11.14.203 netmask 0xfffff000 broadcast 10.11.15.255
        ether 00:1f:5b:3d:ea:ac
        media: autoselect (100baseTX <full-duplex>) status: active
        supported media: autoselect 10baseT/UTP <half-duplex> 10baseT/UTP
<full-duplex> 10baseT/UTP <full-duplex,hw-loopback> 10baseT/UTP
<full-duplex,flow-control> 100baseTX <half-duplex> 100baseTX
<full-duplex> 100baseTX <full-duplex,hw-loopback> 100baseTX
<full-duplex,flow-control> 1000baseT <full-duplex> 1000baseT
<full-duplex,hw-loopback> 1000baseT <full-duplex,flow-control>
en1: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        ether 00:1f:5b:3d:ea:ad
        media: autoselect status: inactive
        supported media: autoselect 10baseT/UTP <half-duplex> 10baseT/UTP
<full-duplex> 10baseT/UTP <full-duplex,hw-loopback> 10baseT/UTP
<full-duplex,flow-control> 100baseTX <half-duplex> 100baseTX
<full-duplex> 100baseTX <full-duplex,hw-loopback> 100baseTX
<full-duplex,flow-control> 1000baseT <full-duplex> 1000baseT
<full-duplex,hw-loopback> 1000baseT <full-duplex,flow-control>
fw0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 4078
        lladdr 00:22:41:ff:fe:ed:7d:a8
        media: autoselect <full-duplex> status: inactive
        supported media: autoselect <full-duplex>


LINUX:
====
pallabdatta@apex-backpack:~/backpack/src$ ifconfig
lo        Link encap:Local Loopback
         inet addr:127.0.0.1  Mask:255.0.0.0
         inet6 addr: ::1/128 Scope:Host
         UP LOOPBACK RUNNING  MTU:16436  Metric:1
         RX packets:116 errors:0 dropped:0 overruns:0 frame:0
         TX packets:116 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:0
         RX bytes:11788 (11.7 KB)  TX bytes:11788 (11.7 KB)

wlan0     Link encap:Ethernet  HWaddr 00:21:79:c2:54:c7
inet addr:10.11.14.205 Bcast:10.11.14.255 Mask: 255.255.240.0
         inet6 addr: fe80::221:79ff:fec2:54c7/64 Scope:Link
         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
         RX packets:72531 errors:0 dropped:0 overruns:0 frame:0
         TX packets:28894 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:1000
         RX bytes:5459312 (5.4 MB)  TX bytes:7264193 (7.2 MB)

wmaster0  Link encap:UNSPEC  HWaddr
00-21-79-C2-54-C7-34-63-00-00-00-00-00-00-00-00
         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
         RX packets:0 errors:0 dropped:0 overruns:0 frame:0
         TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:1000
         RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

The mac is a Two 2.26GHz Quad-Core Intel Xeon Mac Pro and the Linux Box is Ubuntu Server Edition 9.04. The Mac has the ethernet interface to connect to the network and the linux box connects via a wireless adapter (IOGEAR).

Please help me any way I can fix this issue. It really needs to work for
our project.
thanks in advance,
regards,
pallab





My other concern was the following but I am not sure it applies here.
If you have multiple interfaces on the node, and they are on the same
subnet, then you cannot actually select what IP address to go out of.
You can only select the IP address you want to connect to. In these
cases, I have seen a hang because we think we are selecting an IP
address to go out of, but it actually goes out the other one.
Perhaps you can send the User's list the output from "ifconfig" on each of the machines which would show all the interfaces. You need to get the
right arguments for ifconfig depending on the OS you are running on.

One thought is make sure the ethernet interface is marked down on both
boxes if that is possible.

Pallab Datta wrote:
Any suggestions on to how to debug this further..??
do you think I need to enable any other option besides heterogeneous at
the configure proompt.?


The -enable-heterogeneous should do the trick.  And to answer the
previous question, yes, put both of the interfaces in the include list.

--mca btl_tcp_if_include en0,wlan0

If that does not work, then I may have one other thought why it might
not work although perhaps not a solution.

Rolf

Pallab Datta wrote:

Hi Rolf,

Do i need to configure openmpi with some specific options apart from
--enable-heterogeneous..?
I am currently using
./configure --prefix=/usr/local/ --enable-heterogeneous
--disable-static
--enable-shared --enable-debug

on both ends...is the above correct..?! Please let me know.
thanks and regards,
pallab



Hi:
I assume if you wait several minutes than your program will actually time out, yes? I guess I have two suggestions. First, can you run a non-MPI job using the wireless? Something like hostname? Secondly,
you
may want to specify the specific interfaces you want it to use on the
two machines.  You can do that via the "--mca btl_tcp_if_include"
run-time parameter. Just list the ones that you expect it to use.

Also, this is not right - "--mca OMPI_mca_mpi_preconnect_all 1" It should be --mca mpi_preconnect_mpi 1 if you want to do the connection
during MPI_Init.

Rolf

Pallab Datta wrote:


The following is the error dump

fuji:src pallabdatta$ /usr/local/bin/mpirun --mca
btl_tcp_port_min_v4
36900 -mca btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca
btl
tcp,self --mca OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H
localhost,10.11.14.205 /tmp/hello
[fuji.local:01316] mca: base: components_open: Looking for btl
components
[fuji.local:01316] mca: base: components_open: opening btl
components
[fuji.local:01316] mca: base: components_open: found loaded
component
self
[fuji.local:01316] mca: base: components_open: component self has no
register function
[fuji.local:01316] mca: base: components_open: component self open
function successful
[fuji.local:01316] mca: base: components_open: found loaded
component
tcp
[fuji.local:01316] mca: base: components_open: component tcp has no
register function
[fuji.local:01316] mca: base: components_open: component tcp open
function
successful
[fuji.local:01316] select: initializing btl component self
[fuji.local:01316] select: init of component self returned success
[fuji.local:01316] select: initializing btl component tcp
[fuji.local:01316] select: init of component tcp returned success [apex-backpack:04753] mca: base: components_open: Looking for btl
components
[apex-backpack:04753] mca: base: components_open: opening btl
components
[apex-backpack:04753] mca: base: components_open: found loaded
component
self
[apex-backpack:04753] mca: base: components_open: component self has
no
register function
[apex-backpack:04753] mca: base: components_open: component self
open
function successful
[apex-backpack:04753] mca: base: components_open: found loaded
component
tcp
[apex-backpack:04753] mca: base: components_open: component tcp has
no
register function
[apex-backpack:04753] mca: base: components_open: component tcp open
function successful
[apex-backpack:04753] select: initializing btl component self
[apex-backpack:04753] select: init of component self returned
success
[apex-backpack:04753] select: initializing btl component tcp
[apex-backpack:04753] select: init of component tcp returned success
Process 0 on fuji.local out of 2
Process 1 on apex-backpack out of 2
[apex-backpack:04753] btl: tcp: attempting to connect() to address
10.11.14.203 on port 9360







Hi

I am trying to run open-mpi 1.3.3. between a linux box running
ubuntu
server v.9.04 and a Macintosh. I have configured openmpi with the
following options.:
./configure --prefix=/usr/local/ --enable-heterogeneous
--disable-shared
--enable-static

When both the machines are connected to the network via ethernet
cables
openmpi works fine.

But when I switch the linux box to a wireless adapter i can reach
(ping)
the macintosh
but openmpi hangs on a hello world program.

I ran :

/usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca
btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca
OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H
localhost,10.11.14.205
/tmp/back

it hangs on a send receive function between the two ends. All my
firewalls
are turned off at the macintosh end. PLEASE HELP ASAP>
regards,
pallab
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

=========================
rolf.vandeva...@sun.com
781-442-3043
=========================

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

=========================
rolf.vandeva...@sun.com
781-442-3043
=========================







--

=========================
rolf.vandeva...@sun.com
781-442-3043
=========================



_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
jsquy...@cisco.com

Reply via email to