Have you tried running without limiting the port range?

On Sep 24, 2009, at 12:39 PM, Pallab Datta wrote:

Hi All,

Yes I can ping and ssh from apex-backpack to my Mac (fuji.local).
I fixed the wireless broadcast to reflect the same on both ends
(10.11.14.255) but still the problem persists.

I have tried other wireless adapters as well. But no luck till far.
Please let me know what can be done...
regards, pallab

(putting this back on the list where others can reply as well, and if
we solve it, the solution will be google-ized)

According to your debug output:

[apex-backpack:31956] btl: tcp: attempting to connect() to address
10.11.14.203 on port 9360

It *is* trying to connect to the right IP address.  Are you able to
ping to .203 from apex-backpack?

I also notice that you ethernet configuration does not exactly match
between linux and osx:

en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        inet 10.11.14.203 netmask 0xfffff000 broadcast 10.11.15.255

wlan0     Link encap:Ethernet  HWaddr 00:21:79:c2:54:c7
          inet addr:10.11.14.205  Bcast:10.11.14.255  Mask:
255.255.240.0


On Sep 22, 2009, at 9:26 PM, Pallab Datta wrote:

There is no firewall running between the machines. I tried using the
IP
address instead of localhost but it gave me the same output. MPI is
not
even timing out..it keeps eternally hanging on..:(

I have disabled the ethernet interface on the linux box, keeping
only the
wireless up. On the mac i only have the ethernet turned on. My mac
is a 8
core mac pro.

Please help me debug this..
thanks in advance, regards,
pallab


(only replying to users list)

Some suggestions:

- MPI seems to startup but the additional TCP connections required
for
MPI connections seem to be failing / timing out / some other error.
- Are you running firewalls between your machines?  If so, can you
disable them?
- I see that you're specifying "--mca btl_tcp_port_min_v4 36900" but
one of the debug lines reads:
[apex-backpack:31956] btl: tcp: attempting to connect() to address
10.11.14.203 on port 9360
- Try not using the name "localhost", but rather the IP address of
the
local machine


On Sep 22, 2009, at 5:27 PM, Pallab Datta wrote:

The following are the ifconfig for both the Mac and the Linux
respectively:

fuji:openmpi-1.3.3 pallabdatta$ ifconfig
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
        inet 127.0.0.1 netmask 0xff000000
        inet6 ::1 prefixlen 128
gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280
stf0: flags=0<> mtu 1280
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu
1500
        inet6 fe80::21f:5bff:fe3d:eaac%en0 prefixlen 64 scopeid 0x4
        inet 10.11.14.203 netmask 0xfffff000 broadcast 10.11.15.255
        ether 00:1f:5b:3d:ea:ac
        media: autoselect (100baseTX <full-duplex>) status: active
        supported media: autoselect 10baseT/UTP <half-duplex> 10baseT/UTP
<full-duplex> 10baseT/UTP <full-duplex,hw-loopback> 10baseT/UTP
<full-duplex,flow-control> 100baseTX <half-duplex> 100baseTX
<full-duplex> 100baseTX <full-duplex,hw-loopback> 100baseTX
<full-duplex,flow-control> 1000baseT <full-duplex> 1000baseT
<full-duplex,hw-loopback> 1000baseT <full-duplex,flow-control>
en1: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu
1500
        ether 00:1f:5b:3d:ea:ad
        media: autoselect status: inactive
        supported media: autoselect 10baseT/UTP <half-duplex> 10baseT/UTP
<full-duplex> 10baseT/UTP <full-duplex,hw-loopback> 10baseT/UTP
<full-duplex,flow-control> 100baseTX <half-duplex> 100baseTX
<full-duplex> 100baseTX <full-duplex,hw-loopback> 100baseTX
<full-duplex,flow-control> 1000baseT <full-duplex> 1000baseT
<full-duplex,hw-loopback> 1000baseT <full-duplex,flow-control>
fw0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu
4078
        lladdr 00:22:41:ff:fe:ed:7d:a8
        media: autoselect <full-duplex> status: inactive
        supported media: autoselect <full-duplex>


LINUX:
====
pallabdatta@apex-backpack:~/backpack/src$ ifconfig
lo        Link encap:Local Loopback
       inet addr:127.0.0.1  Mask:255.0.0.0
       inet6 addr: ::1/128 Scope:Host
       UP LOOPBACK RUNNING  MTU:16436  Metric:1
       RX packets:116 errors:0 dropped:0 overruns:0 frame:0
       TX packets:116 errors:0 dropped:0 overruns:0 carrier:0
       collisions:0 txqueuelen:0
       RX bytes:11788 (11.7 KB)  TX bytes:11788 (11.7 KB)

wlan0     Link encap:Ethernet  HWaddr 00:21:79:c2:54:c7
       inet addr:10.11.14.205  Bcast:10.11.14.255  Mask:
255.255.240.0
       inet6 addr: fe80::221:79ff:fec2:54c7/64 Scope:Link
       UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
       RX packets:72531 errors:0 dropped:0 overruns:0 frame:0
       TX packets:28894 errors:0 dropped:0 overruns:0 carrier:0
       collisions:0 txqueuelen:1000
       RX bytes:5459312 (5.4 MB)  TX bytes:7264193 (7.2 MB)

wmaster0  Link encap:UNSPEC  HWaddr
00-21-79-C2-54-C7-34-63-00-00-00-00-00-00-00-00
       UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
       RX packets:0 errors:0 dropped:0 overruns:0 frame:0
       TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
       collisions:0 txqueuelen:1000
       RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

The mac is a Two 2.26GHz Quad-Core Intel Xeon Mac Pro and the Linux
Box is
Ubuntu Server Edition 9.04. The Mac has the ethernet interface to
connect
to the network and the linux box connects via a wireless adapter
(IOGEAR).

Please help me any way I can fix this issue. It really needs to work
for
our project.
thanks in advance,
regards,
pallab





My other concern was the following but I am not sure it applies
here.
If you have multiple interfaces on the node, and they are on the
same
subnet, then you cannot actually select what IP address to go out
of.
You can only select the IP address you want to connect to. In these
cases, I have seen a hang because we think we are selecting an IP
address to go out of, but it actually goes out the other one.
Perhaps you can send the User's list the output from "ifconfig" on
each
of the machines which would show all the interfaces. You need to
get the
right arguments for ifconfig depending on the OS you are running
on.

One thought is make sure the ethernet interface is marked down on
both
boxes if that is possible.

Pallab Datta wrote:
Any suggestions on to how to debug this further..??
do you think I need to enable any other option besides
heterogeneous at
the configure proompt.?


The -enable-heterogeneous should do the trick. And to answer the previous question, yes, put both of the interfaces in the include
list.

--mca btl_tcp_if_include en0,wlan0

If that does not work, then I may have one other thought why it
might
not work although perhaps not a solution.

Rolf

Pallab Datta wrote:

Hi Rolf,

Do i need to configure openmpi with some specific options apart
from
--enable-heterogeneous..?
I am currently using
./configure --prefix=/usr/local/ --enable-heterogeneous
--disable-static
--enable-shared --enable-debug

on both ends...is the above correct..?! Please let me know.
thanks and regards,
pallab



Hi:
I assume if you wait several minutes than your program will
actually
time out, yes? I guess I have two suggestions. First, can you
run a
non-MPI job using the wireless?  Something like hostname?
Secondly,
you
may want to specify the specific interfaces you want it to use
on the
two machines.  You can do that via the "--mca
btl_tcp_if_include"
run-time parameter.  Just list the ones that you expect it to
use.

Also, this is not right - "--mca OMPI_mca_mpi_preconnect_all
1"  It
should be --mca mpi_preconnect_mpi 1 if you want to do the
connection
during MPI_Init.

Rolf

Pallab Datta wrote:


The following is the error dump

fuji:src pallabdatta$ /usr/local/bin/mpirun --mca
btl_tcp_port_min_v4
36900 -mca btl_tcp_port_range_v4 32 --mca btl_base_verbose 30
--mca
btl
tcp,self --mca OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H
localhost,10.11.14.205 /tmp/hello
[fuji.local:01316] mca: base: components_open: Looking for btl
components
[fuji.local:01316] mca: base: components_open: opening btl
components
[fuji.local:01316] mca: base: components_open: found loaded
component
self
[fuji.local:01316] mca: base: components_open: component self
has no
register function
[fuji.local:01316] mca: base: components_open: component self
open
function successful
[fuji.local:01316] mca: base: components_open: found loaded
component
tcp
[fuji.local:01316] mca: base: components_open: component tcp
has no
register function
[fuji.local:01316] mca: base: components_open: component tcp
open
function
successful
[fuji.local:01316] select: initializing btl component self
[fuji.local:01316] select: init of component self returned
success
[fuji.local:01316] select: initializing btl component tcp
[fuji.local:01316] select: init of component tcp returned
success
[apex-backpack:04753] mca: base: components_open: Looking for
btl
components
[apex-backpack:04753] mca: base: components_open: opening btl
components
[apex-backpack:04753] mca: base: components_open: found loaded
component
self
[apex-backpack:04753] mca: base: components_open: component
self has
no
register function
[apex-backpack:04753] mca: base: components_open: component
self
open
function successful
[apex-backpack:04753] mca: base: components_open: found loaded
component
tcp
[apex-backpack:04753] mca: base: components_open: component
tcp has
no
register function
[apex-backpack:04753] mca: base: components_open: component
tcp open
function successful
[apex-backpack:04753] select: initializing btl component self [apex-backpack:04753] select: init of component self returned
success
[apex-backpack:04753] select: initializing btl component tcp
[apex-backpack:04753] select: init of component tcp returned
success
Process 0 on fuji.local out of 2
Process 1 on apex-backpack out of 2
[apex-backpack:04753] btl: tcp: attempting to connect() to
address
10.11.14.203 on port 9360







Hi

I am trying to run open-mpi 1.3.3. between a linux box
running
ubuntu
server v.9.04 and a Macintosh. I have configured openmpi with
the
following options.:
./configure --prefix=/usr/local/ --enable-heterogeneous
--disable-shared
--enable-static

When both the machines are connected to the network via
ethernet
cables
openmpi works fine.

But when I switch the linux box to a wireless adapter i can
reach
(ping)
the macintosh
but openmpi hangs on a hello world program.

I ran :

/usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca
btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca
OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H
localhost,10.11.14.205
/tmp/back

it hangs on a send receive function between the two ends. All
my
firewalls
are turned off at the macintosh end. PLEASE HELP ASAP>
regards,
pallab
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

=========================
rolf.vandeva...@sun.com
781-442-3043
=========================

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

=========================
rolf.vandeva...@sun.com
781-442-3043
=========================







--

=========================
rolf.vandeva...@sun.com
781-442-3043
=========================



_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
jsquy...@cisco.com

Reply via email to