Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-24 Thread Jeff Squyres
(putting this back on the list where others can reply as well, and if  
we solve it, the solution will be google-ized)


According to your debug output:


[apex-backpack:31956] btl: tcp: attempting to connect() to address
10.11.14.203 on port 9360


It *is* trying to connect to the right IP address.  Are you able to  
ping to .203 from apex-backpack?


I also notice that you ethernet configuration does not exactly match  
between linux and osx:


en0: flags=8863 mtu 1500
inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255

wlan0 Link encap:Ethernet  HWaddr 00:21:79:c2:54:c7
  inet addr:10.11.14.205  Bcast:10.11.14.255  Mask: 
255.255.240.0



On Sep 22, 2009, at 9:26 PM, Pallab Datta wrote:

There is no firewall running between the machines. I tried using the  
IP
address instead of localhost but it gave me the same output. MPI is  
not

even timing out..it keeps eternally hanging on..:(

I have disabled the ethernet interface on the linux box, keeping  
only the
wireless up. On the mac i only have the ethernet turned on. My mac  
is a 8

core mac pro.

Please help me debug this..
thanks in advance, regards,
pallab



(only replying to users list)

Some suggestions:

- MPI seems to startup but the additional TCP connections required  
for

MPI connections seem to be failing / timing out / some other error.
- Are you running firewalls between your machines?  If so, can you
disable them?
- I see that you're specifying "--mca btl_tcp_port_min_v4 36900" but
one of the debug lines reads:

[apex-backpack:31956] btl: tcp: attempting to connect() to address
10.11.14.203 on port 9360
- Try not using the name "localhost", but rather the IP address of  
the

local machine


On Sep 22, 2009, at 5:27 PM, Pallab Datta wrote:


The following are the ifconfig for both the Mac and the Linux
respectively:

fuji:openmpi-1.3.3 pallabdatta$ ifconfig
lo0: flags=8049 mtu 16384
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
inet 127.0.0.1 netmask 0xff00
inet6 ::1 prefixlen 128
gif0: flags=8010 mtu 1280
stf0: flags=0<> mtu 1280
en0: flags=8863 mtu  
1500

inet6 fe80::21f:5bff:fe3d:eaac%en0 prefixlen 64 scopeid 0x4
inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255
ether 00:1f:5b:3d:ea:ac
media: autoselect (100baseTX ) status: active
supported media: autoselect 10baseT/UTP  10baseT/UTP
 10baseT/UTP  10baseT/UTP
 100baseTX  100baseTX
 100baseTX  100baseTX
 1000baseT  1000baseT
 1000baseT 
en1: flags=8863 mtu  
1500

ether 00:1f:5b:3d:ea:ad
media: autoselect status: inactive
supported media: autoselect 10baseT/UTP  10baseT/UTP
 10baseT/UTP  10baseT/UTP
 100baseTX  100baseTX
 100baseTX  100baseTX
 1000baseT  1000baseT
 1000baseT 
fw0: flags=8863 mtu  
4078

lladdr 00:22:41:ff:fe:ed:7d:a8
media: autoselect  status: inactive
supported media: autoselect 


LINUX:

pallabdatta@apex-backpack:~/backpack/src$ ifconfig
loLink encap:Local Loopback
inet addr:127.0.0.1  Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING  MTU:16436  Metric:1
RX packets:116 errors:0 dropped:0 overruns:0 frame:0
TX packets:116 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:11788 (11.7 KB)  TX bytes:11788 (11.7 KB)

wlan0 Link encap:Ethernet  HWaddr 00:21:79:c2:54:c7
inet addr:10.11.14.205  Bcast:10.11.14.255  Mask:
255.255.240.0
inet6 addr: fe80::221:79ff:fec2:54c7/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:72531 errors:0 dropped:0 overruns:0 frame:0
TX packets:28894 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5459312 (5.4 MB)  TX bytes:7264193 (7.2 MB)

wmaster0  Link encap:UNSPEC  HWaddr
00-21-79-C2-54-C7-34-63-00-00-00-00-00-00-00-00
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

The mac is a Two 2.26GHz Quad-Core Intel Xeon Mac Pro and the Linux
Box is
Ubuntu Server Edition 9.04. The Mac has the ethernet interface to
connect
to the network and the linux box connects via a wireless adapter
(IOGEAR).

Please help me any way I can fix this issue. It really needs to work
for
our project.
thanks in advance,
regards,
pallab





My other concern was the following but I am not sure it applies  
here.
If you have multiple interfaces on the node, and they are on the  
same
subnet, then you cannot actually select what IP address to go out  
of.

You can only select the IP address you want to connect to. In these
cases, I have seen a hang because we think we are selecting an IP
address to go out of, but it actually goes out the other one.
Perhaps you can send the User's list the 

Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-24 Thread Pallab Datta
Hi All,

Yes I can ping and ssh from apex-backpack to my Mac (fuji.local).
I fixed the wireless broadcast to reflect the same on both ends
(10.11.14.255) but still the problem persists.

I have tried other wireless adapters as well. But no luck till far.
Please let me know what can be done...
regards, pallab

> (putting this back on the list where others can reply as well, and if
> we solve it, the solution will be google-ized)
>
> According to your debug output:
>
>>> [apex-backpack:31956] btl: tcp: attempting to connect() to address
>>> 10.11.14.203 on port 9360
>
> It *is* trying to connect to the right IP address.  Are you able to
> ping to .203 from apex-backpack?
>
> I also notice that you ethernet configuration does not exactly match
> between linux and osx:
>
> en0: flags=8863 mtu 1500
>  inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255
>
> wlan0 Link encap:Ethernet  HWaddr 00:21:79:c2:54:c7
>inet addr:10.11.14.205  Bcast:10.11.14.255  Mask:
> 255.255.240.0
>
>
> On Sep 22, 2009, at 9:26 PM, Pallab Datta wrote:
>
>> There is no firewall running between the machines. I tried using the
>> IP
>> address instead of localhost but it gave me the same output. MPI is
>> not
>> even timing out..it keeps eternally hanging on..:(
>>
>> I have disabled the ethernet interface on the linux box, keeping
>> only the
>> wireless up. On the mac i only have the ethernet turned on. My mac
>> is a 8
>> core mac pro.
>>
>> Please help me debug this..
>> thanks in advance, regards,
>> pallab
>>
>>
>>> (only replying to users list)
>>>
>>> Some suggestions:
>>>
>>> - MPI seems to startup but the additional TCP connections required
>>> for
>>> MPI connections seem to be failing / timing out / some other error.
>>> - Are you running firewalls between your machines?  If so, can you
>>> disable them?
>>> - I see that you're specifying "--mca btl_tcp_port_min_v4 36900" but
>>> one of the debug lines reads:
 [apex-backpack:31956] btl: tcp: attempting to connect() to address
 10.11.14.203 on port 9360
>>> - Try not using the name "localhost", but rather the IP address of
>>> the
>>> local machine
>>>
>>>
>>> On Sep 22, 2009, at 5:27 PM, Pallab Datta wrote:
>>>
 The following are the ifconfig for both the Mac and the Linux
 respectively:

 fuji:openmpi-1.3.3 pallabdatta$ ifconfig
 lo0: flags=8049 mtu 16384
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
inet 127.0.0.1 netmask 0xff00
inet6 ::1 prefixlen 128
 gif0: flags=8010 mtu 1280
 stf0: flags=0<> mtu 1280
 en0: flags=8863 mtu
 1500
inet6 fe80::21f:5bff:fe3d:eaac%en0 prefixlen 64 scopeid 0x4
inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255
ether 00:1f:5b:3d:ea:ac
media: autoselect (100baseTX ) status: active
supported media: autoselect 10baseT/UTP  10baseT/UTP
  10baseT/UTP  10baseT/UTP
  100baseTX  100baseTX
  100baseTX  100baseTX
  1000baseT  1000baseT
  1000baseT 
 en1: flags=8863 mtu
 1500
ether 00:1f:5b:3d:ea:ad
media: autoselect status: inactive
supported media: autoselect 10baseT/UTP  10baseT/UTP
  10baseT/UTP  10baseT/UTP
  100baseTX  100baseTX
  100baseTX  100baseTX
  1000baseT  1000baseT
  1000baseT 
 fw0: flags=8863 mtu
 4078
lladdr 00:22:41:ff:fe:ed:7d:a8
media: autoselect  status: inactive
supported media: autoselect 


 LINUX:
 
 pallabdatta@apex-backpack:~/backpack/src$ ifconfig
 loLink encap:Local Loopback
 inet addr:127.0.0.1  Mask:255.0.0.0
 inet6 addr: ::1/128 Scope:Host
 UP LOOPBACK RUNNING  MTU:16436  Metric:1
 RX packets:116 errors:0 dropped:0 overruns:0 frame:0
 TX packets:116 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:0
 RX bytes:11788 (11.7 KB)  TX bytes:11788 (11.7 KB)

 wlan0 Link encap:Ethernet  HWaddr 00:21:79:c2:54:c7
 inet addr:10.11.14.205  Bcast:10.11.14.255  Mask:
 255.255.240.0
 inet6 addr: fe80::221:79ff:fec2:54c7/64 Scope:Link
 UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
 RX packets:72531 errors:0 dropped:0 overruns:0 frame:0
 TX packets:28894 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:1000
 RX bytes:5459312 (5.4 MB)  TX bytes:7264193 (7.2 MB)

 wmaster0  Link encap:UNSPEC  HWaddr
 00-21-79-C2-54-C7-34-63-00-00-00-00-00-00-00-00
 UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
 RX packets:0 errors:0 dropped:0 overruns:0 frame:0
 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:1000
 RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

 The mac is a Two 2.26GHz Quad-Core Intel Xeon Mac Pro and the Linux
 Box is
 Ubuntu 

Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-24 Thread Jeff Squyres

Have you tried running without limiting the port range?

On Sep 24, 2009, at 12:39 PM, Pallab Datta wrote:


Hi All,

Yes I can ping and ssh from apex-backpack to my Mac (fuji.local).
I fixed the wireless broadcast to reflect the same on both ends
(10.11.14.255) but still the problem persists.

I have tried other wireless adapters as well. But no luck till far.
Please let me know what can be done...
regards, pallab


(putting this back on the list where others can reply as well, and if
we solve it, the solution will be google-ized)

According to your debug output:


[apex-backpack:31956] btl: tcp: attempting to connect() to address
10.11.14.203 on port 9360


It *is* trying to connect to the right IP address.  Are you able to
ping to .203 from apex-backpack?

I also notice that you ethernet configuration does not exactly match
between linux and osx:

en0: flags=8863 mtu  
1500

inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255

wlan0 Link encap:Ethernet  HWaddr 00:21:79:c2:54:c7
  inet addr:10.11.14.205  Bcast:10.11.14.255  Mask:
255.255.240.0


On Sep 22, 2009, at 9:26 PM, Pallab Datta wrote:


There is no firewall running between the machines. I tried using the
IP
address instead of localhost but it gave me the same output. MPI is
not
even timing out..it keeps eternally hanging on..:(

I have disabled the ethernet interface on the linux box, keeping
only the
wireless up. On the mac i only have the ethernet turned on. My mac
is a 8
core mac pro.

Please help me debug this..
thanks in advance, regards,
pallab



(only replying to users list)

Some suggestions:

- MPI seems to startup but the additional TCP connections required
for
MPI connections seem to be failing / timing out / some other error.
- Are you running firewalls between your machines?  If so, can you
disable them?
- I see that you're specifying "--mca btl_tcp_port_min_v4 36900"  
but

one of the debug lines reads:

[apex-backpack:31956] btl: tcp: attempting to connect() to address
10.11.14.203 on port 9360

- Try not using the name "localhost", but rather the IP address of
the
local machine


On Sep 22, 2009, at 5:27 PM, Pallab Datta wrote:


The following are the ifconfig for both the Mac and the Linux
respectively:

fuji:openmpi-1.3.3 pallabdatta$ ifconfig
lo0: flags=8049 mtu 16384
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
inet 127.0.0.1 netmask 0xff00
inet6 ::1 prefixlen 128
gif0: flags=8010 mtu 1280
stf0: flags=0<> mtu 1280
en0: flags=8863 mtu
1500
inet6 fe80::21f:5bff:fe3d:eaac%en0 prefixlen 64 scopeid 0x4
inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255
ether 00:1f:5b:3d:ea:ac
media: autoselect (100baseTX ) status: active
supported media: autoselect 10baseT/UTP  10baseT/UTP
 10baseT/UTP  10baseT/UTP
 100baseTX  100baseTX
 100baseTX  100baseTX
 1000baseT  1000baseT
 1000baseT 
en1: flags=8863 mtu
1500
ether 00:1f:5b:3d:ea:ad
media: autoselect status: inactive
supported media: autoselect 10baseT/UTP  10baseT/UTP
 10baseT/UTP  10baseT/UTP
 100baseTX  100baseTX
 100baseTX  100baseTX
 1000baseT  1000baseT
 1000baseT 
fw0: flags=8863 mtu
4078
lladdr 00:22:41:ff:fe:ed:7d:a8
media: autoselect  status: inactive
supported media: autoselect 


LINUX:

pallabdatta@apex-backpack:~/backpack/src$ ifconfig
loLink encap:Local Loopback
   inet addr:127.0.0.1  Mask:255.0.0.0
   inet6 addr: ::1/128 Scope:Host
   UP LOOPBACK RUNNING  MTU:16436  Metric:1
   RX packets:116 errors:0 dropped:0 overruns:0 frame:0
   TX packets:116 errors:0 dropped:0 overruns:0 carrier:0
   collisions:0 txqueuelen:0
   RX bytes:11788 (11.7 KB)  TX bytes:11788 (11.7 KB)

wlan0 Link encap:Ethernet  HWaddr 00:21:79:c2:54:c7
   inet addr:10.11.14.205  Bcast:10.11.14.255  Mask:
255.255.240.0
   inet6 addr: fe80::221:79ff:fec2:54c7/64 Scope:Link
   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
   RX packets:72531 errors:0 dropped:0 overruns:0 frame:0
   TX packets:28894 errors:0 dropped:0 overruns:0 carrier:0
   collisions:0 txqueuelen:1000
   RX bytes:5459312 (5.4 MB)  TX bytes:7264193 (7.2 MB)

wmaster0  Link encap:UNSPEC  HWaddr
00-21-79-C2-54-C7-34-63-00-00-00-00-00-00-00-00
   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
   RX packets:0 errors:0 dropped:0 overruns:0 frame:0
   TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
   collisions:0 txqueuelen:1000
   RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

The mac is a Two 2.26GHz Quad-Core Intel Xeon Mac Pro and the  
Linux

Box is
Ubuntu Server Edition 9.04. The Mac has the ethernet interface to
connect
to the network and the linux box connects via a wireless adapter
(IOGEAR).

Please help me any way I can fix this issue. It really needs to  
work

for
our project.
thanks in advance,
regards,
pallab






My other concern was the following but I am not sure it applies
here.
If

Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-24 Thread Pallab Datta
Yes I had tried that initially it (apex-backpack) was trying to connect
the Mac (10.11.14.203) at port number 4 which is too low. So that's why I
made the port range higher..


> Have you tried running without limiting the port range?
>
> On Sep 24, 2009, at 12:39 PM, Pallab Datta wrote:
>
>> Hi All,
>>
>> Yes I can ping and ssh from apex-backpack to my Mac (fuji.local).
>> I fixed the wireless broadcast to reflect the same on both ends
>> (10.11.14.255) but still the problem persists.
>>
>> I have tried other wireless adapters as well. But no luck till far.
>> Please let me know what can be done...
>> regards, pallab
>>
>>> (putting this back on the list where others can reply as well, and if
>>> we solve it, the solution will be google-ized)
>>>
>>> According to your debug output:
>>>
> [apex-backpack:31956] btl: tcp: attempting to connect() to address
> 10.11.14.203 on port 9360
>>>
>>> It *is* trying to connect to the right IP address.  Are you able to
>>> ping to .203 from apex-backpack?
>>>
>>> I also notice that you ethernet configuration does not exactly match
>>> between linux and osx:
>>>
>>> en0: flags=8863 mtu
>>> 1500
>>> inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255
>>>
>>> wlan0 Link encap:Ethernet  HWaddr 00:21:79:c2:54:c7
>>>   inet addr:10.11.14.205  Bcast:10.11.14.255  Mask:
>>> 255.255.240.0
>>>
>>>
>>> On Sep 22, 2009, at 9:26 PM, Pallab Datta wrote:
>>>
 There is no firewall running between the machines. I tried using the
 IP
 address instead of localhost but it gave me the same output. MPI is
 not
 even timing out..it keeps eternally hanging on..:(

 I have disabled the ethernet interface on the linux box, keeping
 only the
 wireless up. On the mac i only have the ethernet turned on. My mac
 is a 8
 core mac pro.

 Please help me debug this..
 thanks in advance, regards,
 pallab


> (only replying to users list)
>
> Some suggestions:
>
> - MPI seems to startup but the additional TCP connections required
> for
> MPI connections seem to be failing / timing out / some other error.
> - Are you running firewalls between your machines?  If so, can you
> disable them?
> - I see that you're specifying "--mca btl_tcp_port_min_v4 36900"
> but
> one of the debug lines reads:
>> [apex-backpack:31956] btl: tcp: attempting to connect() to address
>> 10.11.14.203 on port 9360
> - Try not using the name "localhost", but rather the IP address of
> the
> local machine
>
>
> On Sep 22, 2009, at 5:27 PM, Pallab Datta wrote:
>
>> The following are the ifconfig for both the Mac and the Linux
>> respectively:
>>
>> fuji:openmpi-1.3.3 pallabdatta$ ifconfig
>> lo0: flags=8049 mtu 16384
>>  inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
>>  inet 127.0.0.1 netmask 0xff00
>>  inet6 ::1 prefixlen 128
>> gif0: flags=8010 mtu 1280
>> stf0: flags=0<> mtu 1280
>> en0: flags=8863 mtu
>> 1500
>>  inet6 fe80::21f:5bff:fe3d:eaac%en0 prefixlen 64 scopeid 0x4
>>  inet 10.11.14.203 netmask 0xf000 broadcast 10.11.15.255
>>  ether 00:1f:5b:3d:ea:ac
>>  media: autoselect (100baseTX ) status: active
>>  supported media: autoselect 10baseT/UTP  10baseT/UTP
>>  10baseT/UTP  10baseT/UTP
>>  100baseTX  100baseTX
>>  100baseTX  100baseTX
>>  1000baseT  1000baseT
>>  1000baseT 
>> en1: flags=8863 mtu
>> 1500
>>  ether 00:1f:5b:3d:ea:ad
>>  media: autoselect status: inactive
>>  supported media: autoselect 10baseT/UTP  10baseT/UTP
>>  10baseT/UTP  10baseT/UTP
>>  100baseTX  100baseTX
>>  100baseTX  100baseTX
>>  1000baseT  1000baseT
>>  1000baseT 
>> fw0: flags=8863 mtu
>> 4078
>>  lladdr 00:22:41:ff:fe:ed:7d:a8
>>  media: autoselect  status: inactive
>>  supported media: autoselect 
>>
>>
>> LINUX:
>> 
>> pallabdatta@apex-backpack:~/backpack/src$ ifconfig
>> loLink encap:Local Loopback
>>inet addr:127.0.0.1  Mask:255.0.0.0
>>inet6 addr: ::1/128 Scope:Host
>>UP LOOPBACK RUNNING  MTU:16436  Metric:1
>>RX packets:116 errors:0 dropped:0 overruns:0 frame:0
>>TX packets:116 errors:0 dropped:0 overruns:0 carrier:0
>>collisions:0 txqueuelen:0
>>RX bytes:11788 (11.7 KB)  TX bytes:11788 (11.7 KB)
>>
>> wlan0 Link encap:Ethernet  HWaddr 00:21:79:c2:54:c7
>>inet addr:10.11.14.205  Bcast:10.11.14.255  Mask:
>> 255.255.240.0
>>inet6 addr: fe80::221:79ff:fec2:54c7/64 Scope:Link
>>UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>RX packets:72531 errors:0 dropped:0 overruns:0 frame:0
>>TX packets:28894 errors:0 dropped:0 overruns:0 carrier:0
>>collisions:0 txqueuelen:1000
>>RX bytes:5459312

Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-24 Thread Jeff Squyres

On Sep 24, 2009, at 12:54 PM, Pallab Datta wrote:

Yes I had tried that initially it (apex-backpack) was trying to  
connect
the Mac (10.11.14.203) at port number 4 which is too low. So that's  
why I

made the port range higher..


Port 4?  OMPI should never connect at port 4; it's privileged.  Was  
that in the debug output?


--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-24 Thread Pallab Datta
Yes it came up when i put the verbose mode in i.e. the debug output..
yes i knew its privileged so thats why i explicity asked it to connect to
a higher port but still it blocks there..:(

> On Sep 24, 2009, at 12:54 PM, Pallab Datta wrote:
>
>> Yes I had tried that initially it (apex-backpack) was trying to
>> connect
>> the Mac (10.11.14.203) at port number 4 which is too low. So that's
>> why I
>> made the port range higher..
>
> Port 4?  OMPI should never connect at port 4; it's privileged.  Was
> that in the debug output?
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-24 Thread Jeff Squyres
Ok, I think we're outta options here :-( -- our debugging output is  
not sufficient to tell us what is going wrong.  If I make a mercurial  
repo with some extra debugging output in it, can you check it out and  
build it?  That way we can run it and add relevant printf's in the  
Right places to see exactly what is failing.


Here's the requirements we need for you to be able to build a  
developer's copy of Open MPI (if you can do this, I'll send you a  
specific Mercurial repository URL):


http://www.open-mpi.org/svn/building.php



On Sep 24, 2009, at 1:19 PM, Pallab Datta wrote:


Yes it came up when i put the verbose mode in i.e. the debug output..
yes i knew its privileged so thats why i explicity asked it to  
connect to

a higher port but still it blocks there..:(


On Sep 24, 2009, at 12:54 PM, Pallab Datta wrote:


Yes I had tried that initially it (apex-backpack) was trying to
connect
the Mac (10.11.14.203) at port number 4 which is too low. So that's
why I
made the port range higher..


Port 4?  OMPI should never connect at port 4; it's privileged.  Was
that in the debug output?

--
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-24 Thread Joshua Bernstein

Hmm,

	On another angle, could this be a name resolution issue? Perhaps apex-backpack 
isn't able to resolve fuji.local and visa versa. Can you ping between the two of 
them using their hostnames rather then their IPs?


-Joshua Bernstein
Senior Software Engineer
Penguin Computing

Pallab Datta wrote:

Yes it came up when i put the verbose mode in i.e. the debug output..
yes i knew its privileged so thats why i explicity asked it to connect to
a higher port but still it blocks there..:(


On Sep 24, 2009, at 12:54 PM, Pallab Datta wrote:


Yes I had tried that initially it (apex-backpack) was trying to
connect
the Mac (10.11.14.203) at port number 4 which is too low. So that's
why I
made the port range higher..

Port 4?  OMPI should never connect at port 4; it's privileged.  Was
that in the debug output?

--
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-24 Thread Pallab Datta
I can do it ..please send me the URL..i can rebuild ompi and see what the
output looks like..

> Ok, I think we're outta options here :-( -- our debugging output is
> not sufficient to tell us what is going wrong.  If I make a mercurial
> repo with some extra debugging output in it, can you check it out and
> build it?  That way we can run it and add relevant printf's in the
> Right places to see exactly what is failing.
>
> Here's the requirements we need for you to be able to build a
> developer's copy of Open MPI (if you can do this, I'll send you a
> specific Mercurial repository URL):
>
>  http://www.open-mpi.org/svn/building.php
>
>
>
> On Sep 24, 2009, at 1:19 PM, Pallab Datta wrote:
>
>> Yes it came up when i put the verbose mode in i.e. the debug output..
>> yes i knew its privileged so thats why i explicity asked it to
>> connect to
>> a higher port but still it blocks there..:(
>>
>>> On Sep 24, 2009, at 12:54 PM, Pallab Datta wrote:
>>>
 Yes I had tried that initially it (apex-backpack) was trying to
 connect
 the Mac (10.11.14.203) at port number 4 which is too low. So that's
 why I
 made the port range higher..
>>>
>>> Port 4?  OMPI should never connect at port 4; it's privileged.  Was
>>> that in the debug output?
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



Re: [OMPI users] Random hangs using btl sm with OpenMPI 1.3.2/1.3.3 + gcc4.4?

2009-09-24 Thread Eugene Loh

Jonathan Dursi wrote:


So to summarize:

OpenMPI 1.3.2 + gcc4.4.0

Test problem with periodic (left neighbour of proc 0 is proc N-1)  
Sendrecv()s:

  Default always hangs in Sendrecv after random number of iterations
  Turning off sm (-mca btl self,tcp) not observed to hang
  Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
  Using fewer than 5 fifos hangs in Sendrecv after random number of  
iterations or Finalize


OpenMPI 1.3.3 + gcc4.4.0

Test problem with periodic (left neighbour of proc 0 is proc N-1)  
Sendrecv()s:
  Default sometimes (~20% of time) hangs in Sendrecv after random  
number of iterations

  Turning off sm (-mca btl self,tcp) not observed to hang
  Using -mca btl_sm_num_fifos 5 (for a 6 task job) not observed to hang
  Using fewer than 5 fifos but more than 2 not observed to hang
  Using 2 fifos sometimes (~20% of time) hangs in Finalize  or  
Sendrecv after random number of iterations but sometimes completes


OpenMPI 1.3.2 + intel 11.0 compilers

We are seeing a problem which we believe to be related; ~1% of 
certain  single-node jobs hang, turning off sm or setting num_fifos to 
NP-1  eliminates this.


I can reproduce this with just Barriers, which keeps the processes all 
in sync.  So, this has nothing to do with processes outrunning one 
another (which wasn't likely in the first place given that you had 
Sendrecv calls).


The problem is fickle.  E.g., building OMPI with -g seems to make the 
problem go away.


I did observe that the sm FIFO would fill up.  That's weird since there 
aren't ever a lot of in-flight messages.  I tried adding a line of code 
that would make a process pause if ever it tried to write to a FIFO that 
seemed full.  That pretty much made the problem go away.  So, I guess 
it's a memory coherency problem:  receive clears the FIFO, but writer 
thinks it's congested.


I tried all sorts of GCC compilers.  The problem seems to set in with 
4.4.0.  I don't know what's significant about that.  It requires moving 
to the 2.18 assembler, but I tried the 2.18 assembler with 4.3.3 and 
that worked okay.  I'd think this has to do with GCC 4.4.x, but you say 
you see the problem with Intel compilers as well.  Hmm.  Maybe an OMPI 
problem that's better exposed with GCC 4.4.x?


Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-24 Thread Pallab Datta
I did try to ping using the hostname but i can't..can that be an issue..??
both of them are sitting on the same subnet !!! let me check if i can
resolve this thing..

> Hmm,
>
>   On another angle, could this be a name resolution issue? Perhaps
> apex-backpack
> isn't able to resolve fuji.local and visa versa. Can you ping between the
> two of
> them using their hostnames rather then their IPs?
>
> -Joshua Bernstein
> Senior Software Engineer
> Penguin Computing
>
> Pallab Datta wrote:
>> Yes it came up when i put the verbose mode in i.e. the debug output..
>> yes i knew its privileged so thats why i explicity asked it to connect
>> to
>> a higher port but still it blocks there..:(
>>
>>> On Sep 24, 2009, at 12:54 PM, Pallab Datta wrote:
>>>
 Yes I had tried that initially it (apex-backpack) was trying to
 connect
 the Mac (10.11.14.203) at port number 4 which is too low. So that's
 why I
 made the port range higher..
>>> Port 4?  OMPI should never connect at port 4; it's privileged.  Was
>>> that in the debug output?
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



[OMPI users] best way to ALLREDUCE multi-dimensional arrays in Fortran?

2009-09-24 Thread Greg Fischer
(I apologize in advance for the simplistic/newbie question.)

I'm performing an ALLREDUCE operation on a multi-dimensional array.  This
operation is the biggest bottleneck in the code, and I'm wondering if
there's a way to do it more efficiently than what I'm doing now.  Here's a
representative example of what's happening:

   ir=1
   do ikl=1,km
 do ij=1,jm
   do ii=1,im
 albuf(ir)=array(ii,ij,ikl,nl,0,ng)
 ir=ir+1
   enddo
 enddo
   enddo
   agbuf=0.0
   call
mpi_allreduce(albuf,agbuf,im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr)
   ir=1
   do ikl=1,km
 do ij=1,jm
   do ii=1,im
 phim(ii,ij,ikl,nl,0,ng)=agbuf(ir)
 ir=ir+1
   enddo
 enddo
   enddo

Is there any way to just do this in one fell swoop, rather than buffering,
transmitting, and unbuffering?  This operation is looped over many times.
Are there savings to be had here?

Thanks,
Greg


Re: [OMPI users] best way to ALLREDUCE multi-dimensional arrays in Fortran?

2009-09-24 Thread Eugene Loh




Greg Fischer wrote:
(I apologize in advance for the simplistic/newbie
question.)
  
I'm performing an ALLREDUCE operation on a multi-dimensional array. 
This operation is the biggest bottleneck in the code, and I'm wondering
if there's a way to do it more efficiently than what I'm doing now. 
Here's a representative example of what's happening:
  
     ir=1
     do ikl=1,km
   do ij=1,jm
     do ii=1,im
  
albuf(ir)=array(ii,ij,ikl,nl,0,ng)
   ir=ir+1
     enddo
   enddo
     enddo
     agbuf=0.0
     call
mpi_allreduce(albuf,agbuf,im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr)
     ir=1
     do ikl=1,km
   do
ij=1,jm
     do ii=1,im
  
phim(ii,ij,ikl,nl,0,ng)=agbuf(ir)
   ir=ir+1
     enddo
   enddo
     enddo
  
Is there any way to just do this in one fell swoop, rather than
buffering, transmitting, and unbuffering?  This operation is looped
over many times.  Are there savings to be had here?

There are three steps here:  buffering, transmitting, and unbuffering. 
Any idea how the run time is distributed among those three steps? 
E.g., if most time is spent in the MPI call, then combining all three
steps into one is unlikely to buy you much... and might even hurt.  If
most of the time is spent in the MPI call, then there may be some
tuning of collective algorithms to do.  I don't have any experience
doing this with OMPI.  I'm just saying it makes some sense to isolate
the problem a little bit more.




Re: [OMPI users] [OMPI devel] Open-MPI between Mac and Linux (ubuntu 9.04) over wireless

2009-09-24 Thread Pallab Datta
I resolved the name resolution issue and re-ran it but it still hangs at
the send-receive calls.
I ran it using:


 /usr/local/bin/mpirun --mca btl_tcp_port_min_v4 36900 -mca
btl_tcp_port_range_v4 32 --mca btl_base_verbose 30 --mca
OMPI_mca_mpi_preconnect_all 1 -np 2 -hetero -H localhost,10.11.14.205
/tmp/hello

> Hmm,
>
>   On another angle, could this be a name resolution issue? Perhaps
> apex-backpack
> isn't able to resolve fuji.local and visa versa. Can you ping between the
> two of
> them using their hostnames rather then their IPs?
>
> -Joshua Bernstein
> Senior Software Engineer
> Penguin Computing
>
> Pallab Datta wrote:
>> Yes it came up when i put the verbose mode in i.e. the debug output..
>> yes i knew its privileged so thats why i explicity asked it to connect
>> to
>> a higher port but still it blocks there..:(
>>
>>> On Sep 24, 2009, at 12:54 PM, Pallab Datta wrote:
>>>
 Yes I had tried that initially it (apex-backpack) was trying to
 connect
 the Mac (10.11.14.203) at port number 4 which is too low. So that's
 why I
 made the port range higher..
>>> Port 4?  OMPI should never connect at port 4; it's privileged.  Was
>>> that in the debug output?
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>